I`m getting the following error ``` Errno 503 Not Ready or L typesense #community-help

I`m getting the following error: ```[Errno 503] No...

Sam Schelfhout

04/22/2024, 12:38 PM

I`m getting the following error:

Copy code

[Errno 503] Not Ready or Lagging

This happens on my development environment, in GKE, with 3 pods. Combined used CPU is 0.01 and RAM is 0.17 GB. Nothing is being indexed and nobody else is using the search at this moment, but we get this error message when searching (and trying to index new documents). Documentation states to add more resources, but I don`t think we lack them right now. Restarting the pods results in the same issue. Anybody has any ideas where to look?

Sam Schelfhout

04/22/2024, 4:05 PM

Found this in the logging:

Copy code

Multi-node with no leader: refusing to reset peers.

Jason Bosco

04/22/2024, 7:47 PM

Looks like the cluster has lost quorum, because majority of the nodes have failed. You want to re-establish the cluster following these instructions: https://typesense.org/docs/guide/high-availability.html#recovering-a-cluster-that-has-lost-quorum

Sam Schelfhout

04/22/2024, 9:24 PM

Thanks, that did the trick!

Sam Schelfhout

04/22/2024, 9:25 PM

But is there a way we can prevent that from happening?

Jason Bosco

04/22/2024, 9:41 PM

In a 3-node cluster, you want to make sure that at least 2 nodes in the cluster are always active. Typesense uses RAFT for consensus, which can only tolerate a max of 1 node failure in a 3-node cluster and a max of 2 node failures in a 5-node cluster. If these failed node recover, then the cluster self-heals.

Jason Bosco

04/22/2024, 9:42 PM

But if more number of nodes fail, then the cluster prevents further reads / writes to prevent split brain and then requires manual intervention

Sam Schelfhout

04/23/2024, 7:58 AM

Yes, I understand that, but I`m more interested in learning why a node would fail in the first place. This happened on our dev/staging environment that`s running on overpowered servers with almost no load and data/indexes on it. We are currently going in production with TypeSense and I want to absorb as much as knowledge as I can while our search cluster grows.

Jason Bosco

04/23/2024, 4:13 PM

Usual failure reasons are OOM, out of disk, cpu, etc. (Typesense logs and syslog should have more info around this). With k8s, usually it ends up being because k8s is reallocating resources and ends up shutting down multiple Typesense nodes at the same time

Sam Schelfhout

04/23/2024, 6:10 PM

Just checked, the logs. 2 nodes were shutdown. But looks like the 2nd was shutdown while the node that was restarting was trying to connect to it. Is setting maxUnavailable to 2 (in a 3 node cluster) the solution or do you have some other advice?

Jason Bosco

04/23/2024, 6:20 PM

Here's a long thread with some general advice on running on k8s: https://github.com/typesense/typesense/issues/465#issuecomment-1689460898

👀 1

ty 1

Open in Slack

Previous Next