New issue happening today in V28. New collection ...
# community-help
s
New issue happening today in V28. New collection in an HA cluster. When we create a scoped API key, it isn't getting replicated to all nodes. I can use /keys/{id} to check each node individually, and my key is missing from node 2 of 3. Any ideas of what is going on and how to prevent this for users?
When I do /keys/ to get a list, I see a couple scoped search keys have been created. But the last 15 that were successfully created on nodes 1 and 3, aren't in the list for node 2.
Using the /stats.json endpoint, we see that pending writes are stuck at 1606. Could this be blocking even API keys from syncing to this node?
When I look at our admin dashboard, you see at 1:40 EDT I started the full update of our collection. This took 23 minutes to run. But the lime green node never reaches the same RAM usage as the other 2, and these 1606 pending writes climbed during that window and have remained. Our CPU never spike above 50%-60% and we have plenty of RAM still. So this doesn't seem like it was starved for resources.
In searching for workarounds, it appears in this article it is recommended to include individual node hostnames as fallback for the frontend client to use, if the load balance hostname becomes unresponsive. https://typesense.org/docs/guide/high-availability.html#when-using-typesense-cloud-or-a-load-balancer Am I reading that correctly? Even if we do this, how do we purge and rebuild a bad node if we get in this state?
Some further detail: For search keys not synced to the bad node, I get 401 unauthorized. If I use a key that already has been synced to that node, I just don't get any response. This is true even of our master admin key. I can use it to hit /keys/ but if I try to query, postman hangs indefinitely. So we are certain we have a node in a bad state. But I'm at a loss on how to proceed.
I'm also noticing a pattern in CPU usage. Even though we don't have any users hitting this yet, every 5 minutes we have some activity. Is this the cluster trying to recover maybe? This is our production environment, but we haven't turned it on for users yet as we've been working through performance issues and bugs.
j
Using the /stats.json endpoint, we see that pending writes are stuck at 1606.
Could this be blocking even API keys from syncing to this node?
This is the core of the issue. Somehow there seems to be some bad write that got into the write queue on that node which is stalling the rest of the writes. Historically this happens when we have a missing validation for some shape of malformed data and end up accepting the write instead of throwing an error up front. So we need to take a look at the raw data on disk to see what's stalling the write queue. Would you be ok if we took a snapshot of the data on disk and loaded it into a debug cluster to take a closer look?
s
@Jason Bosco yes, that should fine. That’s a cluster internal to the TypeSense team that is dropped when you’re done?
j
That's correct
For now though we were able to clear that node out and when it resynced data from the other nodes, it ended up syncing fine
s
Ok. Are you able to look into what happened still? What should our plan be if something like this occurs again?
j
Yeah, we'll look into it still
In the meantime, if it happens again, a rolling restart should fix the issue as well: https://cloud-help-center.typesense.org/article/33-restart-typesense-cloud-cluster
s
Excellent, thank you.
@Jason Bosco We had this happen again where a node's data was out of sync. But it took a long time to detect because not many documents were impacted. We found it because an end user complained of missing data. I'm not even sure when it began. Is there some way to detect that this has occurred, so we know a rolling restart is needed?