Debugging and Recovery of a Stuck Typesense Cluster
TLDR Charlie had a wedged staging cluster. Jason provided debugging and recovery steps, and Adrian helped with more insights. It turns out the issue was insufficient disk space. Once Adrian increased the disk size, the cluster healed itself.
4
1
1
Jul 24, 2023 (2 months ago)
Charlie
04:18 PM• typesense-0 is not healthy, but typesense-1 and -2 are (from curling
/health
)• typesense-2 is the leader
• typesense-1 logs look OK. It's doing GC and getting prevotes from typesense-0.
• typesense-0 logs (note
disrupted
):I20230724 16:14:58.435681 271 node.cpp:1504] node default_group:192.168.146.44:8107:8108 received PreVoteResponse from 192.168.147.30:8107:8108 term 406 granted 0 rejected_by_lease 0 disrupted 0
I20230724 16:14:58.435974 273 node.cpp:1504] node default_group:192.168.146.44:8107:8108 received PreVoteResponse from 192.168.131.48:8107:8108 term 406 granted 0 rejected_by_lease 0 disrupted 1
• It appears that typesense-2 cannot issue RPCs to typesense-1. Logs for typesense-2:
W20230724 16:13:19.779845 332 replicator.cpp:397] Group default_group fail to issue RPC to 192.168.146.30:8107:8108 _consecutive_error_times=20521, [E112]Not connected to 192.168.146.30:8107 yet, server_id=42949673533 [R1][E112]Not connected to 192.168.146.30:8107 yet, server_id=42949673533 [R2][E112]Not connected to 192.168.146.30:8107 yet, server_id=42949673533 [R3][E112]Not connected to 192.168.146.30:8107 yet, server_id=42949673533
W20230724 16:13:19.785646 331 socket.cpp:1270] Fail to wait EPOLLOUT of fd=132: Connection timed out [110]
• search currently works
What should my next steps be? Given that typesense-0 follower is not healthy, if I restart typesense-1, would I experience cluster downtime?
Jason
06:47 PMAnd also the last 100 lines of logs from each of the nodes?
Jason
06:48 PMYes, a 3-node cluster can only handle a failure of 1 node. If 2 nodes fail, then the 3rd node goes into an error state.
The way to recover in this case is described here: https://typesense.org/docs/guide/high-availability.html#recovering-a-cluster-that-has-lost-quorum
The steps described there would also be a good way to recover a cluster that has generally lost quorum for a variety reasons.
Charlie
06:49 PMCharlie
06:49 PMCharlie
06:50 PMCharlie
06:52 PM192.168.147.30:8107:8108,192.168.131.48:8107:8108,192.168.146.44:8107:8108
Charlie
06:53 PMI don't know if it's possible to recover without introducing downtime, however.
Jason
06:53 PM192.168.146.30:8107
in typesense-2 but it’s not in the nodes fileJason
06:54 PMSince typesense-2 is current leader, make that a single node leader by only leaving it’s IP in the nodes file
Jason
06:55 PMJason
06:56 PMJason
06:56 PMJason
06:56 PMCharlie
06:57 PMJason
06:59 PMHere’s how to address it with recent RC builds: https://github.com/typesense/typesense/issues/465#issuecomment-1487780839
Jul 25, 2023 (2 months ago)
Charlie
04:24 PMInterestingly, we're on rc34 and were using
--reset-peers-on-error
when this issue occurred.Jul 28, 2023 (2 months ago)
Adrian
03:24 AMW20230728 03:23:58.060972 262 raft_server.cpp:586] Node with no leader. Resetting peers of size: 3
W20230728 03:23:58.060983 262 node.cpp:898] node default_group:192.168.147.231:8107:8108 is in state ERROR, can't reset_peer
Adrian
03:28 AMAdrian
03:29 AMJason
03:33 AMstate ERROR
, then it requires a process restart even if you add more resources to it while it’s still runningAdrian
03:34 AMdisrupted 1
prevote responses to it without it becoming a leader. What does this prevote response mean and why could that node not become a follow from what you know?Adrian
03:34 AMstate ERROR
, then it requires a process restart even if you add more resources to it while it’s still runninggood to know!
Jason
03:36 AM1
Kishore Nallan
12:02 PMAlso, if the memory/disk runs out, this can cause all kinds of weird issues.
It's difficult to know what happened here but did you try what Jason suggested above to recover the cluster (by making the functioning node as the sole node by modifying its nodes file)?
Adrian
02:02 PMFor all voluntary rotations this is upheld but we cannot guarantee random failures don't occur ofc
Could the force reset of peers be done in more situations? Instead of just during during an error during startup? In our cluster we never change the number of nodes and really no nodes enter or leave, they just restart (with a different IP but same persisted state). I would need to think about it more but I don't think a force reset should ever cause a problem.
Adrian
02:02 PMKishore Nallan
02:04 PMPOST /operations/reset_peers
Must run on each node.
2
Adrian
02:08 PM• There is one leader and 2 followers
• All nodes have a long write queue
• for some reason the leader cannot make progress writing
I see this error in its logs. Do you know what it means? And would you suggest picking that node as the single node leader?
W20230728 14:06:25.658550 262 node.cpp:843] [default_group:192.168.146.211:8107:8108 ] Refusing concurrent configuration changing
E20230728 14:06:25.658622 337 raft_server.h:62] Peer refresh failed, error: Doing another configuration change
Kishore Nallan
02:11 PM1
Adrian
02:36 PMJul 31, 2023 (2 months ago)
Adrian
02:12 PMI20230731 14:04:47.230753 261 node.cpp:943] node default_group:192.168.131.161:8107:8108 starts to do snapshot
E20230731 14:04:47.235330 337 raft_server.cpp:989] Timed snapshot failed, error: Is saving another snapshot, code: 16
E20230731 14:04:50.235631 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:04:50.235687 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:04:57.236423 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:04:59.236770 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:04:59.236827 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:07.237604 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:05:08.237756 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:08.237812 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:17.238811 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:05:17.238858 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:17.238868 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
E20230731 14:05:26.240017 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:26.240113 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:27.240295 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
I20230731 14:05:27.674870 262 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 15634
E20230731 14:05:35.241026 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:35.241086 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:37.241359 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:05:44.242317 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:44.242376 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:47.242801 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:05:53.243332 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:53.243386 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:57.243854 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:06:02.244403 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:02.244460 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:07.244976 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:06:11.245385 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:11.245450 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:17.247084 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:06:20.247385 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:20.247442 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:27.248173 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
I20230731 14:06:28.688766 262 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 15634
E20230731 14:06:29.248378 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:29.248430 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:37.249398 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:06:38.249603 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:38.249656 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:47.250573 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:06:47.250609 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:47.250617 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
E20230731 14:06:56.251624 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:56.251673 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:57.251860 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:05.252596 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:05.252655 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:07.252905 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:14.253587 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:14.253648 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:17.253998 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:23.254606 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:23.254665 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:27.255800 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
I20230731 14:07:29.695968 262 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 15634
E20230731 14:07:32.257431 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:32.257488 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:37.258703 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:41.259912 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:41.259964 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:47.260658 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:50.260986 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:50.261049 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:57.261734 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:59.261977 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:59.262039 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:08:07.262941 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:08:08.263092 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:08:08.263175 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:08:17.264170 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:08:17.264204 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:08:17.264210 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
E20230731 14:08:26.265041 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:08:26.265100 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:08:27.266731 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
I20230731 14:08:30.702734 262 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 15634
E20230731 14:08:35.267469 261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:08:35.267518 261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:08:37.267751 261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027,
Adrian
02:13 PMcurl -H "x-typesense-api-key: $TYPESENSE_API_KEY" localhost:8108/debug
{"state":4,"version":"0.25.0.rc34"}%
Kishore Nallan
02:13 PMKishore Nallan
02:14 PMAdrian
02:14 PMAdrian
02:15 PMKishore Nallan
02:15 PMstate: 4
is follower, not leader.Adrian
02:15 PMKishore Nallan
02:15 PMstate: 1
Adrian
02:16 PMAdrian
02:16 PMKishore Nallan
02:17 PMnodes
file to a single IP, it should autoamtically reset the peers in 60 seconds. However here the state I think is fully corrupted somehow.Kishore Nallan
02:17 PMAdrian
02:18 PMI20230731 14:04:47.230753 261 node.cpp:943] node default_group:192.168.131.161:8107:8108 starts to do snapshot
Adrian
02:19 PMAdrian
02:19 PMAdrian
02:22 PM> E20230731 14:21:47.378849 337 raft_server.h:62] Peer refresh failed, error: Doing another configuration change
so I guess it is stuck doing a config change? Do you know why that could be?
Kishore Nallan
02:29 PMKishore Nallan
02:30 PMAdrian
04:21 PMAdrian
04:21 PM> Update to disk failed. Will restore old document
Adrian
10:38 PMAdrian
10:39 PM1
Aug 01, 2023 (2 months ago)
Kishore Nallan
12:42 AM1
Typesense
Indexed 2779 threads (79% resolved)
Similar Threads
Troubleshooting IP Update on Kubernetes Typesense
Alessandro and Damien are having issues with old IP addresses in a Kubernetes Typesense cluster not being updated. Kishore Nallan provides possible troubleshooting solutions, and mentioned the need for a fix for DNS retries. A suggested update strategy was shared by Aljosa.
Solutions for HA Issues Running Typesense in Kubernetes
Lane shared a solution for fixing HA issues in k8s, involving an InitContainer to update the config map. Kishore Nallan introduced a cluster auto reset peers feature. Users discussed limitations in the approach and potential deployment considerations.
Troubleshooting Typesense Cluster Multi-node Leadership Error
Bill experienced a problem with a new typesense cluster, receiving an error about no leader and health status issues. Jason and Kishore Nallan provided troubleshooting steps and determined it was likely due to a communication issue between nodes. Kishore Nallan identified a potential solution involving resetting the data directory. Following this, Bill reported the error resolved.
Segfault in Typesense 0.25.0rc24 during Node Restart
Charlie reported a segfault while restarting node in a k8s deployment using version 0.25.0.rc24. Kishore Nallan advised rolling rotations for nodes and confirmed that nodes will join as the cluster expands.
Testing High Availability with Raft Returns Crashes
pboros reports an issue with usual crashes when testing high availability with Raft. Kishore Nallan suggests checking the quorum recovery period and efficiently logging the crash on all nodes. The issue persists, with pboros suspecting it's due to hostname being no longer resolvable once a container is killed.