Hi team need some help figuring out what s happening with ou typesense #community-help

Hi team, need some help figuring out what's happen...

Vamshi Aruru

02/28/2025, 2:39 PM

Hi team, need some help figuring out what's happening with our typesense cloud cluster. I am noticing that one of our nodes is going down and the cpu and ram usage is spiking up to 100% causing to errors. This is the second time it happened in two days. From our metrics we didn't see any spike in requests or anything like that. From the screen shots below you can see that there is an interval where metrics for one of the nodes is not available, has it gone down at that time? Any help to debug this is appreciated, as this is serving production traffic. Thanks!

Kishore Nallan

02/28/2025, 2:45 PM

Please share cluster ID in DM

Kishore Nallan

02/28/2025, 2:58 PM

Looks like a period of heavy queries causes a high CPU spike. It has recovered now and looks stable.

Kishore Nallan

02/28/2025, 2:59 PM

Trying to see if we can identify which queries.

Vamshi Aruru

02/28/2025, 3:00 PM

Okay, that would be very helpful. According to our metrics atleast, the traffic at that time was above average but we have consistently hit that traffic multiple times before without anything going down.

Kishore Nallan

02/28/2025, 3:01 PM

Actually on closer inspection that heavy query actually spiked the RAM to cause OOM and the process to restart.

Kishore Nallan

02/28/2025, 3:02 PM

In v28.0 GA we have added some code to log queries that cause spikes like this (provided the spike last for a few seconds atleast). But on the version this cluster is on, this is not available so there is nothing in the logs.

Vamshi Aruru

02/28/2025, 3:04 PM

Okay, we will update the cluster version then and let you know when it happens again. cc: @Nishant Khurana

👍 1

Vamshi Aruru

04/01/2025, 1:08 PM

Hi @Kishore Nallan, we are seeing this again. It looks like the nodes simultaneously went down and got back up again a couple of minutes later. This is the second time it happened today

Kishore Nallan

04/02/2025, 2:32 AM

I'll check and get back to you

Vamshi Aruru

04/02/2025, 12:45 PM

Hi Kishore, any updaates?

Kishore Nallan

04/02/2025, 12:52 PM

Please post the timestamp range in which this happened in UTC

Vamshi Aruru

04/02/2025, 1:02 PM

It happened 6:30 PM IST yesterday, so that's 1:00 PM UTC yesterday (April 1st)

Kishore Nallan

04/02/2025, 1:09 PM

It was a sudden OOM and nodes restarted because of that. Because of the sudden spike and probably because the queries didn't even finish, no bad queries were logged.

Kishore Nallan

04/02/2025, 1:10 PM

In a recent v29 RC build we have made a huge improvement to the way group_by is done.

Kishore Nallan

04/02/2025, 1:10 PM

That change will keep memory bounded, and should probably help here.

Kishore Nallan

04/02/2025, 1:12 PM

We had to make this change because group-by was simply too memory-hungry at time. The side-effect of the new approach is that now we can't return the exact value of `found`(total count of the docs of all the unique values) when

group_by

is mentioned. So, we return an approximate value using an approximate algorithm. This algorithm uses constant memory to approximate the

found

value within +-2% of the actual value.

Vamshi Aruru

04/02/2025, 2:24 PM

Okay, thank you. So your hypothesis is that it went OOM because of some query which had group by right? Until we move 0.29, is there anything we can do prevent it from happening on 0.28?

Kishore Nallan

04/02/2025, 3:19 PM

Yes, I suspect a bad query blowing up memory. You have to use the RC build if that works for you.

Open in Slack

Previous Next