#community-help

Debugging and Recovery of a Stuck Typesense Cluster

TLDR Charlie had a wedged staging cluster. Jason provided debugging and recovery steps, and Adrian helped with more insights. It turns out the issue was insufficient disk space. Once Adrian increased the disk size, the cluster healed itself.

Powered by Struct AI

4

1

1

55
2mo
Solved
Join the chat
Jul 24, 2023 (2 months ago)
Charlie
Photo of md5-3b4a40b7dd97986d0398452f20c236c8
Charlie
04:18 PM
Hello, I've gotten my staging cluster wedged. I'd like to understand how to debug it and get it healthy with minimal downtime.
• typesense-0 is not healthy, but typesense-1 and -2 are (from curling /health)
• typesense-2 is the leader
• typesense-1 logs look OK. It's doing GC and getting prevotes from typesense-0.
• typesense-0 logs (note disrupted):
I20230724 16:14:58.435681   271 node.cpp:1504] node default_group:192.168.146.44:8107:8108 received PreVoteResponse from 192.168.147.30:8107:8108 term 406 granted 0 rejected_by_lease 0 disrupted 0
I20230724 16:14:58.435974   273 node.cpp:1504] node default_group:192.168.146.44:8107:8108 received PreVoteResponse from 192.168.131.48:8107:8108 term 406 granted 0 rejected_by_lease 0 disrupted 1

• It appears that typesense-2 cannot issue RPCs to typesense-1. Logs for typesense-2:
W20230724 16:13:19.779845   332 replicator.cpp:397] Group default_group fail to issue RPC to 192.168.146.30:8107:8108 _consecutive_error_times=20521, [E112]Not connected to 192.168.146.30:8107 yet, server_id=42949673533 [R1][E112]Not connected to 192.168.146.30:8107 yet, server_id=42949673533 [R2][E112]Not connected to 192.168.146.30:8107 yet, server_id=42949673533 [R3][E112]Not connected to 192.168.146.30:8107 yet, server_id=42949673533
W20230724 16:13:19.785646   331 socket.cpp:1270] Fail to wait EPOLLOUT of fd=132: Connection timed out [110]

• search currently works
What should my next steps be? Given that typesense-0 follower is not healthy, if I restart typesense-1, would I experience cluster downtime?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:47 PM
Could you share the contents of your nodes file?

And also the last 100 lines of logs from each of the nodes?
06:48
Jason
06:48 PM
> if I restart typesense-1, would I experience cluster downtime?
Yes, a 3-node cluster can only handle a failure of 1 node. If 2 nodes fail, then the 3rd node goes into an error state.

The way to recover in this case is described here: https://typesense.org/docs/guide/high-availability.html#recovering-a-cluster-that-has-lost-quorum

The steps described there would also be a good way to recover a cluster that has generally lost quorum for a variety reasons.
Charlie
Photo of md5-3b4a40b7dd97986d0398452f20c236c8
Charlie
06:49 PM
06:49
Charlie
06:49 PM
06:50
Charlie
06:50 PM
06:52
Charlie
06:52 PM
The nodes file is similar across all nodes, except with a different order of IPs:
192.168.147.30:8107:8108,192.168.131.48:8107:8108,192.168.146.44:8107:8108
06:53
Charlie
06:53 PM
Interestingly, our cluster is still functioning in its current state. Even though node 2 is struggling to talk to node 1, they are both marked as healthy.

I don't know if it's possible to recover without introducing downtime, however.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:53 PM
I see this IP address in the logs 192.168.146.30:8107 in typesense-2 but it’s not in the nodes file
06:54
Jason
06:54 PM
Here’s the best course of action:

Since typesense-2 is current leader, make that a single node leader by only leaving it’s IP in the nodes file
06:55
Jason
06:55 PM
Without having to restart, that node should eventually log “Peer Refresh Succeeded”
06:56
Jason
06:56 PM
Once you see that, stop the Typesense process on the other two nodes, then clear the data dir on them, and then add one node back into the cluster
06:56
Jason
06:56 PM
Once it is fully caught up with the data, then add the other node back into the cluster
06:56
Jason
06:56 PM
With this you should be able to recover without a downtime
Charlie
Photo of md5-3b4a40b7dd97986d0398452f20c236c8
Charlie
06:57 PM
Thank you Jason! I will report back in a few hours. It sounds like maybe the root cause is a pod got restarted leading to a change in IPs, and the node list is not being picked up correctly by the typesense leader.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:59 PM
Ah yup, on k8s this is a known issue.

Here’s how to address it with recent RC builds: https://github.com/typesense/typesense/issues/465#issuecomment-1487780839
Jul 25, 2023 (2 months ago)
Charlie
Photo of md5-3b4a40b7dd97986d0398452f20c236c8
Charlie
04:24 PM
Appreciate those details.
Interestingly, we're on rc34 and were using --reset-peers-on-error when this issue occurred.
Jul 28, 2023 (2 months ago)
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
03:24 AM
hey Jason are you familiar with this message?
W20230728 03:23:58.060972   262 raft_server.cpp:586] Node with no leader. Resetting peers of size: 3
W20230728 03:23:58.060983   262 node.cpp:898] node default_group:192.168.147.231:8107:8108 is in state ERROR, can't reset_peer
03:28
Adrian
03:28 AM
unfortunately the staging cluster is still wedged. We rolled out an increase in memory limit as while this problem was going on we started hitting memory limits and thus getting OOM kills
03:29
Adrian
03:29 AM
I would like to understand the root cause of how the cluster could get in a broken state like this. As my coworker Charlie mentioned we are on rc34 and had --reset-peers-on-error set. So I would have expected the cluster to be able to recover on its own
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:33 AM
Once there’s an OOM or out of disk and a node goes into state ERROR, then it requires a process restart even if you add more resources to it while it’s still running
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
03:34 AM
it seems like in the original wedged cluster the crux of the problem was that typesense-2 could not reset peers because it was the leader and did not have a quorum. This makes sense but I don't understand why it couldn't form quorum with typesense-0 as it seem to be in contact with this peer, but was sending disrupted 1 prevote responses to it without it becoming a leader. What does this prevote response mean and why could that node not become a follow from what you know?
03:34
Adrian
03:34 AM
> Once there’s an OOM or out of disk and a node goes into state ERROR, then it requires a process restart even if you add more resources to it while it’s still running
good to know!
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:36 AM
I’ll have to defer to Kishore Nallan on this ^

1

Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:02 PM
Raft clustering works on the guarantee of quorum. With k8s if you are unable to guarantee that 2/3 nodes are always up and rotations happen only in a rolling fashion then it's going to cause a lot of pain, in spite of the recent reset peers flag (it helps but does not guarantee a split-brain).

Also, if the memory/disk runs out, this can cause all kinds of weird issues.

It's difficult to know what happened here but did you try what Jason suggested above to recover the cluster (by making the functioning node as the sole node by modifying its nodes file)?
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
02:02 PM
> guarantee that 2/3 nodes are always up and rotations happen only in a rolling fashion
For all voluntary rotations this is upheld but we cannot guarantee random failures don't occur ofc

Could the force reset of peers be done in more situations? Instead of just during during an error during startup? In our cluster we never change the number of nodes and really no nodes enter or leave, they just restart (with a different IP but same persisted state). I would need to think about it more but I don't think a force reset should ever cause a problem.
02:02
Adrian
02:02 PM
in terms of the current state of the cluster there are no healthy nodes, but I will pick one and make it a single node leader
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:04 PM
There is also an API for reset peers

POST /operations/reset_peers

Must run on each node.

2

Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
02:08 PM
The current state is that
• There is one leader and 2 followers
• All nodes have a long write queue
• for some reason the leader cannot make progress writing
I see this error in its logs. Do you know what it means? And would you suggest picking that node as the single node leader?
W20230728 14:06:25.658550   262 node.cpp:843] [default_group:192.168.146.211:8107:8108 ] Refusing concurrent configuration changing
E20230728 14:06:25.658622   337 raft_server.h:62] Peer refresh failed, error: Doing another configuration change
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:11 PM
Yes pick that. The error means that a previous config change is still pending.

1

Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
02:36 PM
is there a way to tell what the current configuration change is and why its stuck?
Jul 31, 2023 (2 months ago)
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
02:12 PM
I had to modify our deployment so I could manually override the nodes list without restarting a pod. I have now done so for one node to make its node list length 1 consisting only of the pods own IP address, but it does not seem to be becoming the leader. Do I have to pick a node that already thinks it's the leader? Or will this node eventually start acting as I expect?
I20230731 14:04:47.230753   261 node.cpp:943] node default_group:192.168.131.161:8107:8108 starts to do snapshot
E20230731 14:04:47.235330   337 raft_server.cpp:989] Timed snapshot failed, error: Is saving another snapshot, code: 16
E20230731 14:04:50.235631   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:04:50.235687   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:04:57.236423   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:04:59.236770   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:04:59.236827   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:07.237604   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:05:08.237756   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:08.237812   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:17.238811   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:05:17.238858   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:17.238868   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
E20230731 14:05:26.240017   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:26.240113   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:27.240295   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
I20230731 14:05:27.674870   262 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 15634
E20230731 14:05:35.241026   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:35.241086   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:37.241359   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:05:44.242317   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:44.242376   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:47.242801   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:05:53.243332   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:05:53.243386   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:05:57.243854   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:06:02.244403   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:02.244460   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:07.244976   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:06:11.245385   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:11.245450   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:17.247084   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:06:20.247385   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:20.247442   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:27.248173   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
I20230731 14:06:28.688766   262 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 15634
E20230731 14:06:29.248378   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:29.248430   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:37.249398   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:06:38.249603   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:38.249656   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:47.250573   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:06:47.250609   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:47.250617   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
E20230731 14:06:56.251624   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:06:56.251673   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:06:57.251860   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:05.252596   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:05.252655   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:07.252905   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:14.253587   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:14.253648   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:17.253998   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:23.254606   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:23.254665   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:27.255800   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
I20230731 14:07:29.695968   262 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 15634
E20230731 14:07:32.257431   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:32.257488   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:37.258703   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:41.259912   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:41.259964   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:47.260658   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:50.260986   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:50.261049   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:07:57.261734   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:07:59.261977   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:07:59.262039   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:08:07.262941   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:08:08.263092   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:08:08.263175   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:08:17.264170   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
E20230731 14:08:17.264204   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:08:17.264210   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
E20230731 14:08:26.265041   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:08:26.265100   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:08:27.266731   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, applying_index: 0, queued_writes: 25166, pending_queue_size: 0, local_sequence: 72315653
I20230731 14:08:30.702734   262 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 15634
E20230731 14:08:35.267469   261 raft_server.cpp:640] 25166 queued writes > healthy read lag of 1000
E20230731 14:08:35.267518   261 raft_server.cpp:652] 25166 queued writes > healthy write lag of 500
I20230731 14:08:37.267751   261 raft_server.cpp:562] Term: 501, last_index index: 9001043, committed_index: 9001043, known_applied_index: 9001027, 
02:13
Adrian
02:13 PM
curl -H "x-typesense-api-key: $TYPESENSE_API_KEY" localhost:8108/debug
{"state":4,"version":"0.25.0.rc34"}%
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:13 PM
Wait for it to finish replaying the queued writes
02:14
Kishore Nallan
02:14 PM
Is that counter moving?
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
02:14 PM
nope
02:15
Adrian
02:15 PM
I was assuming it would have to be the leader in order for the queue to progress
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:15 PM
state: 4 is follower, not leader.
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
02:15 PM
yup
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:15 PM
Leader is state: 1
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
02:16 PM
yup
02:16
Adrian
02:16 PM
> I have now done so for one node to make its node list length 1 consisting only of the pods own IP address, but it does not seem to be becoming the leader.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:17 PM
Once you make the nodes file to a single IP, it should autoamtically reset the peers in 60 seconds. However here the state I think is fully corrupted somehow.
02:17
Kishore Nallan
02:17 PM
I don't think there is any other choice but to restart.
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
02:18 PM
I think the peers did reset based on this
I20230731 14:04:47.230753 261 node.cpp:943] node default_group:192.168.131.161:8107:8108 starts to do snapshot
02:19
Adrian
02:19 PM
it seems to have the correct list of the single IP address
02:19
Adrian
02:19 PM
and when you say restart do you mean just the node? or wipe the entire cluster?
02:22
Adrian
02:22 PM
ah I am seeing this log message again
> E20230731 14:21:47.378849 337 raft_server.h:62] Peer refresh failed, error: Doing another configuration change
so I guess it is stuck doing a config change? Do you know why that could be?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:29 PM
Yes a previous change is preventing further progress.
02:30
Kishore Nallan
02:30 PM
Not sure how it could have got into the state but it typically happens when node IPs change too fast. Like pods coming up and going offline .
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
04:21 PM
so even after a restart the nodes are ending up in the same state stuck in the middle of a config change
04:21
Adrian
04:21 PM
I am seeing this log message. Could this be related to the problem?
> Update to disk failed. Will restore old document
10:38
Adrian
10:38 PM
ah so it looks like disk space was the only issue
10:39
Adrian
10:39 PM
after increasing disk size the cluster healed itself. I did not need to manually modify the nodes list

1

Aug 01, 2023 (2 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:42 AM
Good to know, i was wondering why it was being so stubborn!

1