New issue happening today in V28 New collection in an HA clu typesense #community-help

New issue happening today in V28. New collection ...

Scott Nei

03/14/2025, 7:01 PM

New issue happening today in V28. New collection in an HA cluster. When we create a scoped API key, it isn't getting replicated to all nodes. I can use /keys/{id} to check each node individually, and my key is missing from node 2 of 3. Any ideas of what is going on and how to prevent this for users?

Scott Nei

03/14/2025, 7:07 PM

When I do /keys/ to get a list, I see a couple scoped search keys have been created. But the last 15 that were successfully created on nodes 1 and 3, aren't in the list for node 2.

Scott Nei

03/14/2025, 7:21 PM

Using the /stats.json endpoint, we see that pending writes are stuck at 1606. Could this be blocking even API keys from syncing to this node?

Scott Nei

03/14/2025, 7:27 PM

When I look at our admin dashboard, you see at 1:40 EDT I started the full update of our collection. This took 23 minutes to run. But the lime green node never reaches the same RAM usage as the other 2, and these 1606 pending writes climbed during that window and have remained. Our CPU never spike above 50%-60% and we have plenty of RAM still. So this doesn't seem like it was starved for resources.

Scott Nei

03/14/2025, 7:40 PM

In searching for workarounds, it appears in this article it is recommended to include individual node hostnames as fallback for the frontend client to use, if the load balance hostname becomes unresponsive. https://typesense.org/docs/guide/high-availability.html#when-using-typesense-cloud-or-a-load-balancer Am I reading that correctly? Even if we do this, how do we purge and rebuild a bad node if we get in this state?

Scott Nei

03/14/2025, 8:13 PM

Some further detail: For search keys not synced to the bad node, I get 401 unauthorized. If I use a key that already has been synced to that node, I just don't get any response. This is true even of our master admin key. I can use it to hit /keys/ but if I try to query, postman hangs indefinitely. So we are certain we have a node in a bad state. But I'm at a loss on how to proceed.

Scott Nei

03/14/2025, 8:17 PM

I'm also noticing a pattern in CPU usage. Even though we don't have any users hitting this yet, every 5 minutes we have some activity. Is this the cluster trying to recover maybe? This is our production environment, but we haven't turned it on for users yet as we've been working through performance issues and bugs.

Jason Bosco

03/14/2025, 10:35 PM

Using the /stats.json endpoint, we see that pending writes are stuck at 1606.

Could this be blocking even API keys from syncing to this node?

This is the core of the issue. Somehow there seems to be some bad write that got into the write queue on that node which is stalling the rest of the writes. Historically this happens when we have a missing validation for some shape of malformed data and end up accepting the write instead of throwing an error up front. So we need to take a look at the raw data on disk to see what's stalling the write queue. Would you be ok if we took a snapshot of the data on disk and loaded it into a debug cluster to take a closer look?

Scott Nei

03/15/2025, 1:15 AM

@Jason Bosco yes, that should fine. That’s a cluster internal to the TypeSense team that is dropped when you’re done?

Jason Bosco

03/15/2025, 1:30 AM

That's correct

Jason Bosco

03/15/2025, 1:31 AM

For now though we were able to clear that node out and when it resynced data from the other nodes, it ended up syncing fine

Scott Nei

03/15/2025, 1:32 AM

Ok. Are you able to look into what happened still? What should our plan be if something like this occurs again?

Jason Bosco

03/15/2025, 1:32 AM

Yeah, we'll look into it still

Jason Bosco

03/15/2025, 1:33 AM

In the meantime, if it happens again, a rolling restart should fix the issue as well: https://cloud-help-center.typesense.org/article/33-restart-typesense-cloud-cluster

Scott Nei

03/15/2025, 2:14 AM

Excellent, thank you.

Scott Nei

03/26/2025, 4:38 PM

@Jason Bosco We had this happen again where a node's data was out of sync. But it took a long time to detect because not many documents were impacted. We found it because an end user complained of missing data. I'm not even sure when it began. Is there some way to detect that this has occurred, so we know a rolling restart is needed?

Open in Slack

Previous Next