I`m getting the following error: ```[Errno 503] No...
# community-help
s
I`m getting the following error:
Copy code
[Errno 503] Not Ready or Lagging
This happens on my development environment, in GKE, with 3 pods. Combined used CPU is 0.01 and RAM is 0.17 GB. Nothing is being indexed and nobody else is using the search at this moment, but we get this error message when searching (and trying to index new documents). Documentation states to add more resources, but I don`t think we lack them right now. Restarting the pods results in the same issue. Anybody has any ideas where to look?
Found this in the logging:
Copy code
Multi-node with no leader: refusing to reset peers.
j
Looks like the cluster has lost quorum, because majority of the nodes have failed. You want to re-establish the cluster following these instructions: https://typesense.org/docs/guide/high-availability.html#recovering-a-cluster-that-has-lost-quorum
s
Thanks, that did the trick!
But is there a way we can prevent that from happening?
j
In a 3-node cluster, you want to make sure that at least 2 nodes in the cluster are always active. Typesense uses RAFT for consensus, which can only tolerate a max of 1 node failure in a 3-node cluster and a max of 2 node failures in a 5-node cluster. If these failed node recover, then the cluster self-heals.
But if more number of nodes fail, then the cluster prevents further reads / writes to prevent split brain and then requires manual intervention
s
Yes, I understand that, but I`m more interested in learning why a node would fail in the first place. This happened on our dev/staging environment that`s running on overpowered servers with almost no load and data/indexes on it. We are currently going in production with TypeSense and I want to absorb as much as knowledge as I can while our search cluster grows.
j
Usual failure reasons are OOM, out of disk, cpu, etc. (Typesense logs and syslog should have more info around this). With k8s, usually it ends up being because k8s is reallocating resources and ends up shutting down multiple Typesense nodes at the same time
s
Just checked, the logs. 2 nodes were shutdown. But looks like the 2nd was shutdown while the node that was restarting was trying to connect to it. Is setting maxUnavailable to 2 (in a 3 node cluster) the solution or do you have some other advice?
j
Here's a long thread with some general advice on running on k8s: https://github.com/typesense/typesense/issues/465#issuecomment-1689460898
👀 1
ty 1