Hi everyoen please if someone can help that would be great h typesense #community-help

Hi everyoen, please if someone can help that would...

Alessandro Tagliapietra

09/06/2021, 11:51 PM

Hi everyoen, please if someone can help that would be great, how can I make a node forget the old IP addresses? seems deleting the folders at the end of https://github.com/typesense/typesense/issues/203#issuecomment-885464561 and restarting doesn't help

Kishore Nallan

09/07/2021, 7:45 AM

Unfortunately we don't use kubernetes ourselves so can't help much. However, note that the Raft clustering that Typesense uses expects a quorum to be present, i.e. 2/3 pods should be available at any time. So if pods are rotated whole sale then such problems can happen.

Alessandro Tagliapietra

09/07/2021, 2:31 PM

the problem is that when a single pod is replaced (on a 3 nodes cluster) the other two don't update the third IP address.. but for now we're just running it on a single node

Alessandro Tagliapietra

09/07/2021, 2:32 PM

seems the only way to make it work properly

Kishore Nallan

09/07/2021, 2:43 PM

I see. But I think many people are running it on k8s succefully though.

Alessandro Tagliapietra

09/07/2021, 3:31 PM

Lucky them 😄 we've only had issues because of the old IPs not being updated

Alessandro Tagliapietra

09/07/2021, 3:32 PM

actually is the only issue we've faced, it was otherwise perfect

Kishore Nallan

09/07/2021, 3:33 PM

You can use DNS names also. Have you tried that?

Alessandro Tagliapietra

09/07/2021, 3:48 PM

Yes we use that, we have a statefulset and the config has typsense-0.typesense.svc.cluster.local,typsense-1.typsense etc with all the instances, I can see from the pods that if I use ping or dig the IP changes but the logs shows either:

Copy code

can't do pre_vote as it is not in old-ip1,old-ip2.old-ip3

or that when a node tries to join it tries to connect to old IPs and they give a timeout

Alessandro Tagliapietra

09/07/2021, 3:48 PM

because those pod doesn't exist anymore

Damien Hardy

09/21/2021, 11:28 AM

Same problem here. When updating Statefulset (changing liveness probe for example), Pods are restarted one after the other and their IPs change. Fqdn are the same but cluster cannot startup because :

Copy code

I20210921 11:27:47.180541   164 node.cpp:1484] node default_group:10.91.1.147:8107:8108 term 3 start pre_vote
W20210921 11:27:47.180575   164 node.cpp:1494] node default_group:10.91.1.147:8107:8108 can't do pre_vote as it is not in 10.91.4.30:8107:8108,10.91.1.146:8107:8108,10.91.2.156:8107:8108
I20210921 11:27:49.052819   161 raft_server.cpp:544] Term: 3, last_index index: 265, committed_index: 0, known_applied_index: 263, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 518
W20210921 11:27:49.052857   161 raft_server.cpp:571] Multi-node with no leader: refusing to reset peers.

forever

Kishore Nallan

09/21/2021, 11:30 AM

What end-point / path are you using for liveness and readiness checks?

Damien Hardy

09/21/2021, 11:31 AM

actualy I removed it because as cluster failded at converge. it stoped pods every 10 min

Damien Hardy

09/21/2021, 11:32 AM

because IP are not the same as the very first startup of the Staefulset

Kishore Nallan

09/21/2021, 11:33 AM

Use

/health

for readiness and

/metrics.json

for liveness check. In a 3-pod configuration, only 1 pod can be rotated at a time. If the second pod is rotated too early then cluster can be toast.

Damien Hardy

09/21/2021, 11:34 AM

Ok so it missing some readyness probe ?

Kishore Nallan

09/21/2021, 11:34 AM

cc @Carl -- maybe you can share how you are running Typesense on Kubernetes.

Kishore Nallan

09/21/2021, 11:35 AM

Those logs indicate the clustering state going for a toss. This only happens when > 1 node leave at the same time and new nodes also join. In a 3-node configuration atleast 2 nodes are needed for quorum. Most likely the pods are being rotated away too fast.

Damien Hardy

09/21/2021, 11:38 AM

When I tried readynessprobe. 1 pods is started but never achieve because other one are not and fqdn does not resolve for pod-1.pod.svc.cluster.local and pod-2..pod.svc.cluster.local

Damien Hardy

09/21/2021, 11:39 AM

pod-1 and pod-2 never started

Kishore Nallan

09/21/2021, 11:41 AM

According to https://github.com/typesense/typesense/issues/203#issuecomment-851027569 -- this should work after a couple of minutes. Unfortunately we don't run Kubernetes ourselves so I have limited insights on the specifics.

Damien Hardy

09/21/2021, 11:44 AM

Copy code

E20210921 11:41:13.439110   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-1.typesense-poc.infra.svc.cluster.local
E20210921 11:41:13.448516   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-2.typesense-poc.infra.svc.cluster.local
E20210921 11:41:13.448570   161 configuration.cpp:43] Fail to parse typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108
E20210921 11:41:13.448576   161 raft_server.cpp:51] Failed to parse nodes configuration: `typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-2.typesense-poc.infra.svc.cluster.local:8107:8108` --  will retry shortly...
E20210921 11:41:43.464110   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-1.typesense-poc.infra.svc.cluster.local
E20210921 11:41:43.475195   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-2.typesense-poc.infra.svc.cluster.local
E20210921 11:41:43.475241   161 configuration.cpp:43] Fail to parse typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108
E20210921 11:41:43.475247   161 raft_server.cpp:51] Failed to parse nodes configuration: `typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-2.typesense-poc.infra.svc.cluster.local:8107:8108` --  will retry shortly...
E20210921 11:42:13.489486   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-1.typesense-poc.infra.svc.cluster.local
E20210921 11:42:13.500676   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-2.typesense-poc.infra.svc.cluster.local
E20210921 11:42:13.500722   161 configuration.cpp:43] Fail to parse typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108
E20210921 11:42:13.500726   161 raft_server.cpp:47] Giving up parsing nodes configuration: `typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-2.typesense-poc.infra.svc.cluster.local:8107:8108`
E20210921 11:42:13.500731   161 typesense_server_utils.cpp:256] Failed to start peering state

Here is what append when readynesssprobe is appluied whene creating the statefulset

Kishore Nallan

09/21/2021, 11:45 AM

That DNS should be resolvable right?

typesense-poc-1.typesense-poc.infra.svc.cluster.local

--> did you exec into a pod to see if that DNS resolves?

Damien Hardy

09/21/2021, 11:47 AM

pod-1 does not exists at that time so no fqdn . But I should try https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#pod-management-policy

Kishore Nallan

09/21/2021, 11:50 AM

If the process waited enough, will the other pods launch or are the pods launched only 1 at a time?

Damien Hardy

09/21/2021, 11:52 AM

with the default strategy 1 at a time

Carl

09/21/2021, 11:53 AM

Don't think we've encountered any of these issues since we just run the one pod/node. So I can't help, sorry 😞 The only issue we encountered was with the liveliness probe timing out during a longer migration when updating typesense, causing the pod to constantly restart which is probably unrelated.

Damien Hardy

09/21/2021, 12:53 PM

publishNotReadyAddresses=True

https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

Kishore Nallan

09/21/2021, 1:14 PM

Adding

publishNotReadyAddresses=true

worked?

Damien Hardy

09/21/2021, 1:22 PM

It seams :

Copy code

I20210921 13:20:52.996469   170 node.cpp:1484] node default_group:10.91.4.53:8107:8108 term 1 start pre_vote
W20210921 13:20:52.998214   173 node.cpp:1464] node default_group:10.91.4.53:8107:8108 request PreVote from 10.91.0.52:8107:8108 error: [E2][10.91.0.52:8107][E2]peer_id not exist
W20210921 13:20:52.998258   170 node.cpp:1464] node default_group:10.91.4.53:8107:8108 request PreVote from 10.91.2.166:8107:8108 error: [E2][10.91.2.166:8107][E2]peer_id not exist
I20210921 13:20:58.671039   173 node.cpp:1484] node default_group:10.91.4.53:8107:8108 term 1 start pre_vote
I20210921 13:20:58.673763   165 node.cpp:1435] node default_group:10.91.4.53:8107:8108 received PreVoteResponse from 10.91.0.52:8107:8108 term 1 granted 1
I20210921 13:20:58.673924   165 node.cpp:1549] node default_group:10.91.4.53:8107:8108 term 1 start vote and grant vote self
I20210921 13:20:58.677614   165 raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 2 votedfor 10.91.4.53:8107:8108 time: 3280
W20210921 13:20:58.677717   170 node.cpp:1402] node default_group:10.91.4.53:8107:8108 received invalid PreVoteResponse from 10.91.2.166:8107:8108 ctx_version 5current_ctx_version 6
I20210921 13:20:58.680274   173 node.cpp:1348] node default_group:10.91.4.53:8107:8108 received RequestVoteResponse from 10.91.0.52:8107:8108 term 2 granted 1
I20210921 13:20:58.680315   173 node.cpp:1783] node default_group:10.91.4.53:8107:8108 term 2 become leader of group 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108 
I20210921 13:20:58.680354   173 replicator.cpp:138] Replicator=1099511627789@10.91.0.52:8107:8108 is started, group default_group
I20210921 13:20:58.680768   173 replicator.cpp:138] Replicator=1138166333441@10.91.2.166:8107:8108 is started, group default_group
W20210921 13:20:58.681746   165 node.cpp:1315] node default_group:10.91.4.53:8107:8108 received invalid RequestVoteResponse from 10.91.2.166:8107:8108 ctx_version 1 current_ctx_version 2
I20210921 13:20:58.681857   173 log.cpp:108] Created new segment `/usr/share/typesense/data/state/log/log_inprogress_00000000000000000001' with fd=25
I20210921 13:20:58.683583   170 raft_server.h:254] Configuration of this group is 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108
I20210921 13:20:58.683687   170 node.cpp:3142] node default_group:10.91.4.53:8107:8108 reset ConfigurationCtx, new_peers: 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108, old_peers: 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108
I20210921 13:20:58.684051   165 raft_server.h:237] Node becomes leader, term: 2
I20210921 13:21:00.731025   162 raft_server.cpp:544] Term: 2, last_index index: 1, committed_index: 1, known_applied_index: 1, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:21:00.731228   173 raft_server.h:59] Peer refresh succeeded!

Let's try some update

✅ 1

Kishore Nallan

09/21/2021, 1:22 PM

👍

Damien Hardy

09/21/2021, 1:45 PM

Hmmm this started quite well first node was restarted and join the custer, tthen kubernetes wait it for being ready an restart the second pod doing the same thing. But when the last one restarted typesense-2 failed abrutpdly :

Copy code

I20210921 13:37:53.245733   162 batched_indexer.cpp:174] Running GC for aborted requests, req map size: 0
I20210921 13:38:00.361184   161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:10.365317   161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:20.368885   161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:30.370760   161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:31.142607   165 node.cpp:1549] node default_group:10.91.1.151:8107:8108 term 2 start vote and grant vote self
I20210921 13:38:31.142769   169 raft_server.h:263] Node stops following { leader_id=10.91.4.54:8107:8108, term=2, status=A follower's leader_id is reset to NULL as it begins to request_vote.}
W20210921 13:38:31.144012   172 node.cpp:1377] node default_group:10.91.1.151:8107:8108 received RequestVoteResponse from 10.91.4.54:8107:8108 error: [E111]Fail to connect Socket{id=384 addr=10.91.4.54:8107} (0x0x7f285a0f7c00): Connection refused
I20210921 13:38:31.146870   165 raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 3 votedfor 10.91.1.151:8107:8108 time: 3212
I20210921 13:38:31.146903   165 node.cpp:1077] node default_group:10.91.1.151:8107:8108 received handle_timeout_now_request from 10.91.4.54:53018 at term=2
I20210921 13:38:31.151132   168 node.cpp:1348] node default_group:10.91.1.151:8107:8108 received RequestVoteResponse from 10.91.2.169:8107:8108 term 3 granted 1
I20210921 13:38:31.151156   168 node.cpp:1783] node default_group:10.91.1.151:8107:8108 term 3 become leader of group 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108 
I20210921 13:38:31.151181   168 replicator.cpp:138] Replicator=2216203124741@10.91.4.54:8107:8108 is started, group default_group
I20210921 13:38:31.151351   168 replicator.cpp:138] Replicator=3311419785217@10.91.2.169:8107:8108 is started, group default_group
W20210921 13:38:31.151746   165 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=1, [E111]Fail to connect Socket{id=8589934977 addr=10.91.4.54:8107} (0x0x7f285a0f7e00): Connection refused [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
I20210921 13:38:31.155787   172 raft_server.h:254] Configuration of this group is 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108
I20210921 13:38:31.155813   172 node.cpp:3142] node default_group:10.91.1.151:8107:8108 reset ConfigurationCtx, new_peers: 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108, old_peers: 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108
I20210921 13:38:31.156275   165 raft_server.h:237] Node becomes leader, term: 3
I20210921 13:38:31.251830   165 socket.cpp:2201] Checking Socket{id=8589934977 addr=10.91.4.54:8107} (0x7f285a0f7e00)
W20210921 13:38:33.652561   171 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=11, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:34.752591   165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:36.153350   168 replicator.cpp:292] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=21, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:38.252923   171 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:38.654515   171 replicator.cpp:292] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=31, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
I20210921 13:38:40.373698   161 raft_server.cpp:544] Term: 3, last_index index: 8, committed_index: 8, known_applied_index: 8, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:40.373783   165 raft_server.h:59] Peer refresh succeeded!
W20210921 13:38:41.155587   172 replicator.cpp:292] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=41, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:41.753235   165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:43.656654   172 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=51, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:45.253569   165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:46.157712   171 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=61, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:48.658804   165 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=71, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:48.753927   165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
E20210921 13:38:50.383044   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-0.typesense-poc.infra.svc.cluster.local
E20210921 13:38:50.383810   161 configuration.cpp:43] Fail to parse typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108
I20210921 13:38:50.383844   161 raft_server.cpp:544] Term: 3, last_index index: 8, committed_index: 8, known_applied_index: 8, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:50.383863   161 node.cpp:3029] node default_group:10.91.1.151:8107:8108 change_peers from 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108 to , begin removing.
F20210921 13:38:50.383977   172 replicator.cpp:607] Check failed: !entry->peers->empty() log_index=9
*** Check failure stack trace: ***
    @          0x1592de2  google::LogMessage::Fail()
    @          0x1592d40  google::LogMessage::SendToLog()
    @          0x1592682  google::LogMessage::Flush()
    @          0x1595b5c  google::LogMessageFatal::~LogMessageFatal()
    @           0xc5754d  braft::Replicator::_prepare_entry()
    @           0xc5ffb5  braft::Replicator::_send_entries()
    @           0xc60a08  braft::Replicator::_continue_sending()
    @           0xca92ed  braft::LogManager::run_on_new_log()
    @           0xcff76f  bthread::TaskGroup::task_runner()
    @           0xe5a8c1  bthread_make_fcontext
E20210921 13:38:50.651902   172 backward.hpp:4199] Stack trace (most recent call last) in thread 172:
E20210921 13:38:50.651933   172 backward.hpp:4199] #12   Object "/opt/typesense-server", at 0xe5a8c0, in bthread_make_fcontext
E20210921 13:38:50.651943   172 backward.hpp:4199] #11   Object "/opt/typesense-server", at 0xcff76e, in bthread::TaskGroup::task_runner(long)
E20210921 13:38:50.651948   172 backward.hpp:4199] #10   Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/log_manager.cpp", line 831, in run_on_new_log [0xca92ec]
E20210921 13:38:50.651952   172 backward.hpp:4199] #9    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/replicator.cpp", line 723, in _continue_sending [0xc60a07]
E20210921 13:38:50.651955   172 backward.hpp:4199] #8    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/replicator.cpp", line 649, in _send_entries [0xc5ffb4]
E20210921 13:38:50.651959   172 backward.hpp:4199] #7    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/replicator.cpp", line 617, in _prepare_entry [0xc5754c]
E20210921 13:38:50.651962   172 backward.hpp:4199] #6    Object "/opt/typesense-server", at 0x1595b5b, in google::LogMessageFatal::~LogMessageFatal()
E20210921 13:38:50.651965   172 backward.hpp:4199] #5    Object "/opt/typesense-server", at 0x1592681, in google::LogMessage::Flush()
E20210921 13:38:50.651969   172 backward.hpp:4199] #4    Object "/opt/typesense-server", at 0x1592d3f, in google::LogMessage::SendToLog()
E20210921 13:38:50.651975   172 backward.hpp:4199] #3    Object "/opt/typesense-server", at 0x1592de1, in google::LogMessage::Fail()
E20210921 13:38:50.651979   172 backward.hpp:4199] #2    Object "/opt/typesense-server", at 0x159a079, in google::DumpStackTraceAndExit()
E20210921 13:38:50.651983   172 backward.hpp:4199] #1    Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f288a992039, in abort
E20210921 13:38:50.651988   172 backward.hpp:4199] #0    Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f288a990438, in raise
Aborted (Signal sent by tkill() 1 0)
E20210921 13:38:50.939128   172 typesense_server.cpp:88] Typesense is terminating abruptly.

Kishore Nallan

09/21/2021, 1:46 PM

typesense-2 failed with the above logs when typesense-3 was restarted?

Damien Hardy

09/21/2021, 1:46 PM

yes

Damien Hardy

09/21/2021, 1:47 PM

when typesense-0 was restarted

Kishore Nallan

09/21/2021, 1:47 PM

I see

Unable to resolve host: typesense-poc-0.typesense-poc.infra.svc.cluster.local

so it seems like even typesense-0 is not getting resolved?

Damien Hardy

09/21/2021, 1:47 PM

(update goes form 2 to 0)

Kishore Nallan

09/21/2021, 1:47 PM

Ok got it.

Kishore Nallan

09/21/2021, 1:48 PM

Checking to see if we do retries on DNS elsewhere after the initial start.

Damien Hardy

09/21/2021, 1:50 PM

running typesense/typesense:0.22.0.rcs6

Kishore Nallan

09/21/2021, 1:54 PM

Instead of running rolling restart, can you try doing manual "restart" by killing a pod and letting K8s replace it?

Kishore Nallan

09/21/2021, 1:55 PM

Identify the leader pod, and then you can do this sequence: follower, follower, leader.

Kishore Nallan

09/21/2021, 1:57 PM

After each pod terminates, wait for leader log to fully recover before attempting next pod.

Damien Hardy

09/21/2021, 1:58 PM

on a second attempt it succed.

Damien Hardy

09/21/2021, 1:59 PM

can I ensure leader is typesense-0 before applying update ?

Kishore Nallan

09/21/2021, 2:00 PM

Technically it should not matter which order you do. I suggested that only to help with debugging.

Kishore Nallan

09/21/2021, 2:00 PM

That crash log, though looks like a glog issue I have seen before. So it might just be a bad luck. Try a few more times to see what happens.

Damien Hardy

09/21/2021, 2:00 PM

as I use terraform to apply update I can just relly on redyness probe to wait before switch

Kishore Nallan

09/21/2021, 2:01 PM

Here's one way to make a particular node a leader: https://typesense.org/docs/0.21.0/api/cluster-operations.html#re-elect-leader

Kishore Nallan

09/21/2021, 2:04 PM

Oh wait, it's not a glog issue, I see a fatal error:

F20210921 133850.383977 172 replicator.cpp:607] Check failed: !entry->peers->empty() log_index=9

Manually controlling rotation is probably what we can try first to see if somehow the health readiness is not enough.

Kishore Nallan

09/21/2021, 2:11 PM

It looks like we are not doing retries on DNS during subsequent refreshes. I will have to fix that.

👍 1

Aljosa Asanovic

11/11/2021, 3:50 AM

Hey @Kishore Nallan, is the DNS retries something which you may have addressed in an RC build ?

Aljosa Asanovic

11/11/2021, 3:51 AM

Sorry I see comments on the issue on github now 👌

Kishore Nallan

11/11/2021, 3:57 AM

Yes the crucial thing is the readiness and health check. You should just use TCP port check.

Kishore Nallan

11/11/2021, 3:59 AM

Or use the metrics end point like I have suggested but that endpoint requires auth. It's better not to use Kubernetes pod rotation based on health check because the /health end point returns unhealthy if node lags behind writes a bit but that's not a reason for replacing the pod.

Aljosa Asanovic

11/23/2021, 4:11 AM

The best way I've found to update a 3 node setup is to use the

partition

setting on the rollingUpdate. By setting it to

and applying an update, only the third ordinal typesense node in a statefulset gets updated. Once all three nodes are in sync, I lower the setting to

and apply my change again, which only affects the second node. Lastly,

to finish up. It ensures it goes one at a time

Copy code

spec:
  serviceName: ts
  podManagementPolicy: Parallel
  replicas: 3
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 2

Kishore Nallan

11/23/2021, 4:28 AM

👍 This is a good approach. Confirming that nodes finish catching up is crucial to ensuring that you don't accidentally end up killing 2 pods at the same time.

Open in Slack

Previous Next