Typesense Cluster Operations and Recovery
TLDR Adrian inquired about Typesense handling reads and recovering from quorum loss in high availability settings. Kishore Nallan clarified the design decisions and mentioned the automatic recovery in specific scenarios with v0.25 RC.
Mar 28, 2023 (8 months ago)
1. It makes sense to me that Typesense stops accepting writes when quorum is lost, but why does it stop accepting reads? Given that reads are "served by the node that receives it" during normal operation. I don't see why reads could not continue to be served by any nodes still running once quorum is lost. The reads should be no more out of date than already possible in normal operation - unless I am missing something here
2. Why can the cluster not recover once quorum is lost without manual intervention? If quorum is lost, but then the down nodes come back online. I would expect a normal election to be possible and for a new leader to be elected. The documentation cites the risk of a split brain, but afaik this is not possible in raft as any writes require ack from a majority of nodes, thus there can be only one active leader at a time.
I appreciate any input on these point!
Mar 29, 2023 (8 months ago)
Kishore Nallan12:29 AM
2. Cluster does recover in case of quorum loss automatically. If you mean the issues that we have seen on kubernetes, then that's different in nature. The raft nodes are identified by their IP address so when all 3 pods are replaced and come back up new IP addresses, the new pods think they are misconfigured because their IP address do not match what's on the internal cluster state. I've just done some work to overcome this issue automatically now. Please try this out and let me know if that helps: https://typesense-community.slack.com/archives/C01P749MET0/p1680002564216939?thread_ts=1678739509.970809&cid=C01P749MET0
Kishore Nallan01:36 AM
2. Makes sense and great to hear! I will try it out. But does that mean this paragraph from the documentation is not accurate?
> If a Typesense cluster loses more than
(N-1/2)nodes at the same time, the cluster becomes unstable because it loses quorum and the remaining node(s) cannot safely build consensus on which node is the leader. To avoid a potential split brain issue, Typesense then stops accepting writes and reads until some manual verification and intervention is done.
This makes me think manual intervention is required in the case of quorum loss, but what you said about it being automatic fits with what I would expect.
Kishore Nallan03:01 AM
The auto recovery is specifically for the scenario when the nodes come back up but with different IPs and try to form the cluster again. This previously required manual intervention in v0.24, but with 0.25 RC (which is still a beta build) we can handle this. Still needs more testing.
Indexed 3005 threads (79% resolved)
Debugging and Recovery of a Stuck Typesense Cluster
Charlie had a wedged staging cluster. Jason provided debugging and recovery steps, and Adrian helped with more insights. It turns out the issue was insufficient disk space. Once Adrian increased the disk size, the cluster healed itself.
Troubleshooting IP Update on Kubernetes Typesense
Alessandro and Damien are having issues with old IP addresses in a Kubernetes Typesense cluster not being updated. Kishore Nallan provides possible troubleshooting solutions, and mentioned the need for a fix for DNS retries. A suggested update strategy was shared by Aljosa.
Solutions for HA Issues Running Typesense in Kubernetes
Lane shared a solution for fixing HA issues in k8s, involving an InitContainer to update the config map. Kishore Nallan introduced a cluster auto reset peers feature. Users discussed limitations in the approach and potential deployment considerations.