Hi everyoen, please if someone can help that would...
# community-help
a
Hi everyoen, please if someone can help that would be great, how can I make a node forget the old IP addresses? seems deleting the folders at the end of https://github.com/typesense/typesense/issues/203#issuecomment-885464561 and restarting doesn't help
k
Unfortunately we don't use kubernetes ourselves so can't help much. However, note that the Raft clustering that Typesense uses expects a quorum to be present, i.e. 2/3 pods should be available at any time. So if pods are rotated whole sale then such problems can happen.
a
the problem is that when a single pod is replaced (on a 3 nodes cluster) the other two don't update the third IP address.. but for now we're just running it on a single node
seems the only way to make it work properly
k
I see. But I think many people are running it on k8s succefully though.
a
Lucky them 😄 we've only had issues because of the old IPs not being updated
actually is the only issue we've faced, it was otherwise perfect
k
You can use DNS names also. Have you tried that?
a
Yes we use that, we have a statefulset and the config has typsense-0.typesense.svc.cluster.local,typsense-1.typsense etc with all the instances, I can see from the pods that if I use ping or dig the IP changes but the logs shows either:
Copy code
can't do pre_vote as it is not in old-ip1,old-ip2.old-ip3
or that when a node tries to join it tries to connect to old IPs and they give a timeout
because those pod doesn't exist anymore
d
Same problem here. When updating Statefulset (changing liveness probe for example), Pods are restarted one after the other and their IPs change. Fqdn are the same but cluster cannot startup because :
Copy code
I20210921 11:27:47.180541   164 node.cpp:1484] node default_group:10.91.1.147:8107:8108 term 3 start pre_vote
W20210921 11:27:47.180575   164 node.cpp:1494] node default_group:10.91.1.147:8107:8108 can't do pre_vote as it is not in 10.91.4.30:8107:8108,10.91.1.146:8107:8108,10.91.2.156:8107:8108
I20210921 11:27:49.052819   161 raft_server.cpp:544] Term: 3, last_index index: 265, committed_index: 0, known_applied_index: 263, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 518
W20210921 11:27:49.052857   161 raft_server.cpp:571] Multi-node with no leader: refusing to reset peers.
forever
k
What end-point / path are you using for liveness and readiness checks?
d
actualy I removed it because as cluster failded at converge. it stoped pods every 10 min
because IP are not the same as the very first startup of the Staefulset
k
Use
/health
for readiness and
/metrics.json
for liveness check. In a 3-pod configuration, only 1 pod can be rotated at a time. If the second pod is rotated too early then cluster can be toast.
d
Ok so it missing some readyness probe ?
k
cc @Carl -- maybe you can share how you are running Typesense on Kubernetes.
Those logs indicate the clustering state going for a toss. This only happens when > 1 node leave at the same time and new nodes also join. In a 3-node configuration atleast 2 nodes are needed for quorum. Most likely the pods are being rotated away too fast.
d
When I tried readynessprobe. 1 pods is started but never achieve because other one are not and fqdn does not resolve for pod-1.pod.svc.cluster.local and pod-2..pod.svc.cluster.local
pod-1 and pod-2 never started
k
According to https://github.com/typesense/typesense/issues/203#issuecomment-851027569 -- this should work after a couple of minutes. Unfortunately we don't run Kubernetes ourselves so I have limited insights on the specifics.
d
Copy code
E20210921 11:41:13.439110   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-1.typesense-poc.infra.svc.cluster.local
E20210921 11:41:13.448516   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-2.typesense-poc.infra.svc.cluster.local
E20210921 11:41:13.448570   161 configuration.cpp:43] Fail to parse typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108
E20210921 11:41:13.448576   161 raft_server.cpp:51] Failed to parse nodes configuration: `typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-2.typesense-poc.infra.svc.cluster.local:8107:8108` --  will retry shortly...
E20210921 11:41:43.464110   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-1.typesense-poc.infra.svc.cluster.local
E20210921 11:41:43.475195   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-2.typesense-poc.infra.svc.cluster.local
E20210921 11:41:43.475241   161 configuration.cpp:43] Fail to parse typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108
E20210921 11:41:43.475247   161 raft_server.cpp:51] Failed to parse nodes configuration: `typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-2.typesense-poc.infra.svc.cluster.local:8107:8108` --  will retry shortly...
E20210921 11:42:13.489486   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-1.typesense-poc.infra.svc.cluster.local
E20210921 11:42:13.500676   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-2.typesense-poc.infra.svc.cluster.local
E20210921 11:42:13.500722   161 configuration.cpp:43] Fail to parse typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108
E20210921 11:42:13.500726   161 raft_server.cpp:47] Giving up parsing nodes configuration: `typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-2.typesense-poc.infra.svc.cluster.local:8107:8108`
E20210921 11:42:13.500731   161 typesense_server_utils.cpp:256] Failed to start peering state
Here is what append when readynesssprobe is appluied whene creating the statefulset
k
That DNS should be resolvable right?
typesense-poc-1.typesense-poc.infra.svc.cluster.local
--> did you exec into a pod to see if that DNS resolves?
d
k
If the process waited enough, will the other pods launch or are the pods launched only 1 at a time?
d
with the default strategy 1 at a time
c
Don't think we've encountered any of these issues since we just run the one pod/node. So I can't help, sorry 😞 The only issue we encountered was with the liveliness probe timing out during a longer migration when updating typesense, causing the pod to constantly restart which is probably unrelated.
d
k
Adding
publishNotReadyAddresses=true
worked?
d
It seams :
Copy code
I20210921 13:20:52.996469   170 node.cpp:1484] node default_group:10.91.4.53:8107:8108 term 1 start pre_vote
W20210921 13:20:52.998214   173 node.cpp:1464] node default_group:10.91.4.53:8107:8108 request PreVote from 10.91.0.52:8107:8108 error: [E2][10.91.0.52:8107][E2]peer_id not exist
W20210921 13:20:52.998258   170 node.cpp:1464] node default_group:10.91.4.53:8107:8108 request PreVote from 10.91.2.166:8107:8108 error: [E2][10.91.2.166:8107][E2]peer_id not exist
I20210921 13:20:58.671039   173 node.cpp:1484] node default_group:10.91.4.53:8107:8108 term 1 start pre_vote
I20210921 13:20:58.673763   165 node.cpp:1435] node default_group:10.91.4.53:8107:8108 received PreVoteResponse from 10.91.0.52:8107:8108 term 1 granted 1
I20210921 13:20:58.673924   165 node.cpp:1549] node default_group:10.91.4.53:8107:8108 term 1 start vote and grant vote self
I20210921 13:20:58.677614   165 raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 2 votedfor 10.91.4.53:8107:8108 time: 3280
W20210921 13:20:58.677717   170 node.cpp:1402] node default_group:10.91.4.53:8107:8108 received invalid PreVoteResponse from 10.91.2.166:8107:8108 ctx_version 5current_ctx_version 6
I20210921 13:20:58.680274   173 node.cpp:1348] node default_group:10.91.4.53:8107:8108 received RequestVoteResponse from 10.91.0.52:8107:8108 term 2 granted 1
I20210921 13:20:58.680315   173 node.cpp:1783] node default_group:10.91.4.53:8107:8108 term 2 become leader of group 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108 
I20210921 13:20:58.680354   173 replicator.cpp:138] Replicator=1099511627789@10.91.0.52:8107:8108 is started, group default_group
I20210921 13:20:58.680768   173 replicator.cpp:138] Replicator=1138166333441@10.91.2.166:8107:8108 is started, group default_group
W20210921 13:20:58.681746   165 node.cpp:1315] node default_group:10.91.4.53:8107:8108 received invalid RequestVoteResponse from 10.91.2.166:8107:8108 ctx_version 1 current_ctx_version 2
I20210921 13:20:58.681857   173 log.cpp:108] Created new segment `/usr/share/typesense/data/state/log/log_inprogress_00000000000000000001' with fd=25
I20210921 13:20:58.683583   170 raft_server.h:254] Configuration of this group is 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108
I20210921 13:20:58.683687   170 node.cpp:3142] node default_group:10.91.4.53:8107:8108 reset ConfigurationCtx, new_peers: 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108, old_peers: 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108
I20210921 13:20:58.684051   165 raft_server.h:237] Node becomes leader, term: 2
I20210921 13:21:00.731025   162 raft_server.cpp:544] Term: 2, last_index index: 1, committed_index: 1, known_applied_index: 1, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:21:00.731228   173 raft_server.h:59] Peer refresh succeeded!
Let's try some update
1
k
👍
d
Hmmm this started quite well first node was restarted and join the custer, tthen kubernetes wait it for being ready an restart the second pod doing the same thing. But when the last one restarted typesense-2 failed abrutpdly :
Copy code
I20210921 13:37:53.245733   162 batched_indexer.cpp:174] Running GC for aborted requests, req map size: 0
I20210921 13:38:00.361184   161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:10.365317   161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:20.368885   161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:30.370760   161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:31.142607   165 node.cpp:1549] node default_group:10.91.1.151:8107:8108 term 2 start vote and grant vote self
I20210921 13:38:31.142769   169 raft_server.h:263] Node stops following { leader_id=10.91.4.54:8107:8108, term=2, status=A follower's leader_id is reset to NULL as it begins to request_vote.}
W20210921 13:38:31.144012   172 node.cpp:1377] node default_group:10.91.1.151:8107:8108 received RequestVoteResponse from 10.91.4.54:8107:8108 error: [E111]Fail to connect Socket{id=384 addr=10.91.4.54:8107} (0x0x7f285a0f7c00): Connection refused
I20210921 13:38:31.146870   165 raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 3 votedfor 10.91.1.151:8107:8108 time: 3212
I20210921 13:38:31.146903   165 node.cpp:1077] node default_group:10.91.1.151:8107:8108 received handle_timeout_now_request from 10.91.4.54:53018 at term=2
I20210921 13:38:31.151132   168 node.cpp:1348] node default_group:10.91.1.151:8107:8108 received RequestVoteResponse from 10.91.2.169:8107:8108 term 3 granted 1
I20210921 13:38:31.151156   168 node.cpp:1783] node default_group:10.91.1.151:8107:8108 term 3 become leader of group 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108 
I20210921 13:38:31.151181   168 replicator.cpp:138] Replicator=2216203124741@10.91.4.54:8107:8108 is started, group default_group
I20210921 13:38:31.151351   168 replicator.cpp:138] Replicator=3311419785217@10.91.2.169:8107:8108 is started, group default_group
W20210921 13:38:31.151746   165 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=1, [E111]Fail to connect Socket{id=8589934977 addr=10.91.4.54:8107} (0x0x7f285a0f7e00): Connection refused [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
I20210921 13:38:31.155787   172 raft_server.h:254] Configuration of this group is 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108
I20210921 13:38:31.155813   172 node.cpp:3142] node default_group:10.91.1.151:8107:8108 reset ConfigurationCtx, new_peers: 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108, old_peers: 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108
I20210921 13:38:31.156275   165 raft_server.h:237] Node becomes leader, term: 3
I20210921 13:38:31.251830   165 socket.cpp:2201] Checking Socket{id=8589934977 addr=10.91.4.54:8107} (0x7f285a0f7e00)
W20210921 13:38:33.652561   171 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=11, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:34.752591   165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:36.153350   168 replicator.cpp:292] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=21, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:38.252923   171 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:38.654515   171 replicator.cpp:292] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=31, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
I20210921 13:38:40.373698   161 raft_server.cpp:544] Term: 3, last_index index: 8, committed_index: 8, known_applied_index: 8, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:40.373783   165 raft_server.h:59] Peer refresh succeeded!
W20210921 13:38:41.155587   172 replicator.cpp:292] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=41, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:41.753235   165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:43.656654   172 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=51, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:45.253569   165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:46.157712   171 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=61, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:48.658804   165 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=71, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:48.753927   165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
E20210921 13:38:50.383044   161 raft_server.cpp:182] Unable to resolve host: typesense-poc-0.typesense-poc.infra.svc.cluster.local
E20210921 13:38:50.383810   161 configuration.cpp:43] Fail to parse typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108
I20210921 13:38:50.383844   161 raft_server.cpp:544] Term: 3, last_index index: 8, committed_index: 8, known_applied_index: 8, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:50.383863   161 node.cpp:3029] node default_group:10.91.1.151:8107:8108 change_peers from 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108 to , begin removing.
F20210921 13:38:50.383977   172 replicator.cpp:607] Check failed: !entry->peers->empty() log_index=9
*** Check failure stack trace: ***
    @          0x1592de2  google::LogMessage::Fail()
    @          0x1592d40  google::LogMessage::SendToLog()
    @          0x1592682  google::LogMessage::Flush()
    @          0x1595b5c  google::LogMessageFatal::~LogMessageFatal()
    @           0xc5754d  braft::Replicator::_prepare_entry()
    @           0xc5ffb5  braft::Replicator::_send_entries()
    @           0xc60a08  braft::Replicator::_continue_sending()
    @           0xca92ed  braft::LogManager::run_on_new_log()
    @           0xcff76f  bthread::TaskGroup::task_runner()
    @           0xe5a8c1  bthread_make_fcontext
E20210921 13:38:50.651902   172 backward.hpp:4199] Stack trace (most recent call last) in thread 172:
E20210921 13:38:50.651933   172 backward.hpp:4199] #12   Object "/opt/typesense-server", at 0xe5a8c0, in bthread_make_fcontext
E20210921 13:38:50.651943   172 backward.hpp:4199] #11   Object "/opt/typesense-server", at 0xcff76e, in bthread::TaskGroup::task_runner(long)
E20210921 13:38:50.651948   172 backward.hpp:4199] #10   Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/log_manager.cpp", line 831, in run_on_new_log [0xca92ec]
E20210921 13:38:50.651952   172 backward.hpp:4199] #9    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/replicator.cpp", line 723, in _continue_sending [0xc60a07]
E20210921 13:38:50.651955   172 backward.hpp:4199] #8    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/replicator.cpp", line 649, in _send_entries [0xc5ffb4]
E20210921 13:38:50.651959   172 backward.hpp:4199] #7    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/replicator.cpp", line 617, in _prepare_entry [0xc5754c]
E20210921 13:38:50.651962   172 backward.hpp:4199] #6    Object "/opt/typesense-server", at 0x1595b5b, in google::LogMessageFatal::~LogMessageFatal()
E20210921 13:38:50.651965   172 backward.hpp:4199] #5    Object "/opt/typesense-server", at 0x1592681, in google::LogMessage::Flush()
E20210921 13:38:50.651969   172 backward.hpp:4199] #4    Object "/opt/typesense-server", at 0x1592d3f, in google::LogMessage::SendToLog()
E20210921 13:38:50.651975   172 backward.hpp:4199] #3    Object "/opt/typesense-server", at 0x1592de1, in google::LogMessage::Fail()
E20210921 13:38:50.651979   172 backward.hpp:4199] #2    Object "/opt/typesense-server", at 0x159a079, in google::DumpStackTraceAndExit()
E20210921 13:38:50.651983   172 backward.hpp:4199] #1    Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f288a992039, in abort
E20210921 13:38:50.651988   172 backward.hpp:4199] #0    Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f288a990438, in raise
Aborted (Signal sent by tkill() 1 0)
E20210921 13:38:50.939128   172 typesense_server.cpp:88] Typesense is terminating abruptly.
k
typesense-2 failed with the above logs when typesense-3 was restarted?
d
yes
when typesense-0 was restarted
k
I see
Unable to resolve host: typesense-poc-0.typesense-poc.infra.svc.cluster.local
so it seems like even typesense-0 is not getting resolved?
d
(update goes form 2 to 0)
k
Ok got it.
Checking to see if we do retries on DNS elsewhere after the initial start.
d
running typesense/typesense:0.22.0.rcs6
k
Instead of running rolling restart, can you try doing manual "restart" by killing a pod and letting K8s replace it?
Identify the leader pod, and then you can do this sequence: follower, follower, leader.
After each pod terminates, wait for leader log to fully recover before attempting next pod.
d
on a second attempt it succed.
can I ensure leader is typesense-0 before applying update ?
k
Technically it should not matter which order you do. I suggested that only to help with debugging.
That crash log, though looks like a glog issue I have seen before. So it might just be a bad luck. Try a few more times to see what happens.
d
as I use terraform to apply update I can just relly on redyness probe to wait before switch
k
Oh wait, it's not a glog issue, I see a fatal error:
F20210921 133850.383977 172 replicator.cpp:607] Check failed: !entry->peers->empty() log_index=9
Manually controlling rotation is probably what we can try first to see if somehow the health readiness is not enough.
It looks like we are not doing retries on DNS during subsequent refreshes. I will have to fix that.
👍 1
a
Hey @Kishore Nallan, is the DNS retries something which you may have addressed in an RC build ?
Sorry I see comments on the issue on github now 👌
k
Yes the crucial thing is the readiness and health check. You should just use TCP port check.
Or use the metrics end point like I have suggested but that endpoint requires auth. It's better not to use Kubernetes pod rotation based on health check because the /health end point returns unhealthy if node lags behind writes a bit but that's not a reason for replacing the pod.
a
The best way I've found to update a 3 node setup is to use the
partition
setting on the rollingUpdate. By setting it to
2
and applying an update, only the third ordinal typesense node in a statefulset gets updated. Once all three nodes are in sync, I lower the setting to
1
and apply my change again, which only affects the second node. Lastly,
0
to finish up. It ensures it goes one at a time
Copy code
spec:
  serviceName: ts
  podManagementPolicy: Parallel
  replicas: 3
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 2
k
👍 This is a good approach. Confirming that nodes finish catching up is crucial to ensuring that you don't accidentally end up killing 2 pods at the same time.