Hi team, need some help figuring out what's happen...
# community-help
v
Hi team, need some help figuring out what's happening with our typesense cloud cluster. I am noticing that one of our nodes is going down and the cpu and ram usage is spiking up to 100% causing to errors. This is the second time it happened in two days. From our metrics we didn't see any spike in requests or anything like that. From the screen shots below you can see that there is an interval where metrics for one of the nodes is not available, has it gone down at that time? Any help to debug this is appreciated, as this is serving production traffic. Thanks!
k
Please share cluster ID in DM
Looks like a period of heavy queries causes a high CPU spike. It has recovered now and looks stable.
Trying to see if we can identify which queries.
v
Okay, that would be very helpful. According to our metrics atleast, the traffic at that time was above average but we have consistently hit that traffic multiple times before without anything going down.
k
Actually on closer inspection that heavy query actually spiked the RAM to cause OOM and the process to restart.
In v28.0 GA we have added some code to log queries that cause spikes like this (provided the spike last for a few seconds atleast). But on the version this cluster is on, this is not available so there is nothing in the logs.
v
Okay, we will update the cluster version then and let you know when it happens again. cc: @Nishant Khurana
👍 1
Hi @Kishore Nallan, we are seeing this again. It looks like the nodes simultaneously went down and got back up again a couple of minutes later. This is the second time it happened today
k
I'll check and get back to you
v
Hi Kishore, any updaates?
k
Please post the timestamp range in which this happened in UTC
v
It happened 6:30 PM IST yesterday, so that's 1:00 PM UTC yesterday (April 1st)
k
It was a sudden OOM and nodes restarted because of that. Because of the sudden spike and probably because the queries didn't even finish, no bad queries were logged.
In a recent v29 RC build we have made a huge improvement to the way group_by is done.
That change will keep memory bounded, and should probably help here.
We had to make this change because group-by was simply too memory-hungry at time. The side-effect of the new approach is that now we can't return the exact value of `found`(total count of the docs of all the unique values) when
group_by
is mentioned. So, we return an approximate value using an approximate algorithm. This algorithm uses constant memory to approximate the
found
value within +-2% of the actual value.
v
Okay, thank you. So your hypothesis is that it went OOM because of some query which had group by right? Until we move 0.29, is there anything we can do prevent it from happening on 0.28?
k
Yes, I suspect a bad query blowing up memory. You have to use the RC build if that works for you.