Hey! We're seeing issues with our Typesense Cloud ...
# community-help
j
Hey! We're seeing issues with our Typesense Cloud search cluster being pinned at 100% CPU. It has been for quite some time. Is there any way to 1) see why this happens? 2) reboot the instance or something in an attempt to recover? The CPU spike seems to coincide with a peculiar drop in memory usage.
Seems like CPU usage finally came down. Any way to get some insight into what caused it?
k
Likely a bad query. Ping me your cluster ID and I can check the logs.
Actually this looks more like a restart, check if your node ran out of memory.
j
The memory usage seems to be comfortably below the 8 gb limit. Maybe the spike was short enough to not show in the charts tho? 🤷
k
I just checked and the node indeed seems to have abruptly restarted. And usually we produce a stack trace but in this case, I don't find anything in the logs. Have not seen this before.
Seems to have happened again
We might have to try and get you upgraded to latest v28 RC build since it contains some stability fixes. Can we do that now?
j
@Jonatan Svennberg are we ready to upgrade right now and have some downtime?
yes lets do the upgrade as soon as possible
k
Can I go ahead?
j
If we anyways have outages caused by unavailability we might as well!
k
Ok I've kick started it.
j
Seems like the upgrade is well under way. The new node is exhibiting the same CPU spike on startup tho. I don't really understand why?
k
Data is indexed at the start.
j
I wouldn't expect it to take this long tho. Maybe there are old versions of the collections lingering that haven't been cleared out since you changed our re-indexing setup @Joel Ödlund? 🤔
j
there are several collections in there. Since recently, we keep a copy of the collection to allow reindexing without outage, and anable rollbacks. There is also extra collections for ML etc
k
This is how Typesense works. On a restart, the data on disk has to be reindex on memory. For production uses cases, to avoid zero down time, we do rolling rotation of HA clusters.
j
i did not expect all data to be reindexed at start though. Would it not make sense to have a disk representation as well for quicker recoveries? Especially since reindexing is consuming openAI credits
k
Already generated vectors (once the server has done a snapshot) will not require openai calls.
👍 1
Snapshot runs every hour.
And before an upgrade / rotation.
j
@Kishore Nallan will a HA cluster upgrade cause downtime?
k
No, nodes are upgraded in a rolling fashion
j
even when going from 1-> 3?
k
If you use the load balanced endpoint before starting the config change then there won’t be downtime. Otherwise if you're using the -1 hostname, then there will be a downtime at the end since that node is also rotated when enabling HA.