Hello! We had an issue yesterday in our productio...
# community-help
s
Hello! We had an issue yesterday in our production server where we were getting 503 unavailable errors. Typesense was responding with error 503 (Not ready or lagging) during 3 min approx (from:
2025-07-23 14:21:23.933000 UTC
, to:
2025-07-23 14:24:45.066000 UTC
). We saw issues with creating API keys, making search queries, and writing updates. 1. Did you see any logs or issues with our [yuke...] server during that time? 2. Are you able to help us know any root cause for the issue? 3. Is there any logging we could access or troubleshooting we could have done?
Hi Typesense team, just following up on this.
k
Hi Scott. We do see a restart of the Typesense server at that time point. However, we don't have a clear stack trace on what caused this restart. Very rarely a process restart can happen suddenly without a stack trace (e.g. memory corruption that prevents capturing stack trace impossible). We will have to keep on the look out again and if it happens again, we have to add more logging to try and capture the state.
s
@Kishore Nallan would we have been protected if we were HA mode with several nodes? What is the failover threshold?
k
Yes, with HA it's unlikely that an obscure error like this will take out multiple nodes. Client will automatically fail over.
s
We need to build the client to failover? I thought you handled routing. I remember @Jason Bosco explained once the target host for a client is set based on the client DNS. Will your failover logic move them to a new host as soon as they get a 503? Or after a certain if failures? Will it move all traffic off right away (shift all clients to new host as soon as one encounters a problem?) or only point a user to a new host if they get an error? Asked another way: Many of our users will have their DNS routed to host A. If one user gets 503 and needs to be redirected, do you proactively shift all other users you have mapped to host A? Would you automatically spin up a new node and kill the old one, or do you give it some time to start responding again?
k
If your client is using the load balanced DNS this automatically happens. We recommend configuring clients using both load balanced DNS and the individual host names. The official clients can also quarantine bad hosts if individual hosts are used.
s
What are the failover thresholds for the load balanced DNS hostname?
j
30s And during that window, the client libraries use the fallback individual hostnames in a round-robin fashion
thankyou 1
s
@Jason Bosco We had another incident like this again yesterday. It was from about 3:10 - 3:13 PM EDT yesterday. Very similar symptoms: peaked at 100% cpu and 90%+ during the window, search was unresponsive, and 503 (Not ready or lagging) errors. We log all pubsub messages to track writes to TypeSense, and there was no unusual spike in incoming data. But we did see a small bump in pending writes at the start of the incident. Was there anything happening on our cluster during this window you can see? I've upgraded our CPU for now just to mitigate if we're getting CPU bound for some reason. Customers notice as soon as search is down, so this is becoming an issue. What can we do to track what is causing this?
j
This time a stack trace was logged thankfully. I've filed an internal bug report for this. @Alan Martini will keep you posted on progress as we dig into it. In the meantime, if uptime is critical I would recommend enabling HA on this cluster.