Hey We re seeing issues with our Typesense Cloud search clus typesense #community-help

Hey! We're seeing issues with our Typesense Cloud ...

Jonatan Svennberg

12/11/2024, 11:01 AM

Hey! We're seeing issues with our Typesense Cloud search cluster being pinned at 100% CPU. It has been for quite some time. Is there any way to 1) see why this happens? 2) reboot the instance or something in an attempt to recover? The CPU spike seems to coincide with a peculiar drop in memory usage.

Jonatan Svennberg

12/11/2024, 11:05 AM

Seems like CPU usage finally came down. Any way to get some insight into what caused it?

Kishore Nallan

12/11/2024, 11:06 AM

Likely a bad query. Ping me your cluster ID and I can check the logs.

Kishore Nallan

12/11/2024, 11:07 AM

Actually this looks more like a restart, check if your node ran out of memory.

Jonatan Svennberg

12/11/2024, 11:08 AM

The memory usage seems to be comfortably below the 8 gb limit. Maybe the spike was short enough to not show in the charts tho? 🤷

Kishore Nallan

12/11/2024, 11:10 AM

I just checked and the node indeed seems to have abruptly restarted. And usually we produce a stack trace but in this case, I don't find anything in the logs. Have not seen this before.

Kishore Nallan

12/11/2024, 11:15 AM

Seems to have happened again

Kishore Nallan

12/11/2024, 11:16 AM

We might have to try and get you upgraded to latest v28 RC build since it contains some stability fixes. Can we do that now?

Joel Ödlund

12/11/2024, 11:27 AM

@Jonatan Svennberg are we ready to upgrade right now and have some downtime?

Joel Ödlund

12/11/2024, 11:29 AM

yes lets do the upgrade as soon as possible

Kishore Nallan

12/11/2024, 11:33 AM

Can I go ahead?

Jonatan Svennberg

12/11/2024, 11:34 AM

If we anyways have outages caused by unavailability we might as well!

Kishore Nallan

12/11/2024, 11:38 AM

Ok I've kick started it.

Jonatan Svennberg

12/11/2024, 11:53 AM

Seems like the upgrade is well under way. The new node is exhibiting the same CPU spike on startup tho. I don't really understand why?

Kishore Nallan

12/11/2024, 11:56 AM

Data is indexed at the start.

Jonatan Svennberg

12/11/2024, 11:57 AM

I wouldn't expect it to take this long tho. Maybe there are old versions of the collections lingering that haven't been cleared out since you changed our re-indexing setup @Joel Ödlund? 🤔

Joel Ödlund

12/11/2024, 12:04 PM

there are several collections in there. Since recently, we keep a copy of the collection to allow reindexing without outage, and anable rollbacks. There is also extra collections for ML etc

Kishore Nallan

12/11/2024, 12:04 PM

This is how Typesense works. On a restart, the data on disk has to be reindex on memory. For production uses cases, to avoid zero down time, we do rolling rotation of HA clusters.

Joel Ödlund

12/11/2024, 12:09 PM

i did not expect all data to be reindexed at start though. Would it not make sense to have a disk representation as well for quicker recoveries? Especially since reindexing is consuming openAI credits

Kishore Nallan

12/11/2024, 12:10 PM

Already generated vectors (once the server has done a snapshot) will not require openai calls.

👍 1

Kishore Nallan

12/11/2024, 12:10 PM

Snapshot runs every hour.

Kishore Nallan

12/11/2024, 12:10 PM

And before an upgrade / rotation.

Jonatan Svennberg

12/11/2024, 12:38 PM

@Kishore Nallan will a HA cluster upgrade cause downtime?

Kishore Nallan

12/11/2024, 1:05 PM

No, nodes are upgraded in a rolling fashion

Jonatan Svennberg

12/11/2024, 1:06 PM

even when going from 1-> 3?

Kishore Nallan

12/11/2024, 3:38 PM

If you use the load balanced endpoint before starting the config change then there won’t be downtime. Otherwise if you're using the -1 hostname, then there will be a downtime at the end since that node is also rotated when enabling HA.

Open in Slack

Previous Next