#community-help

Resolving Typesense Cloud Cluster Issue with Cron Job

TLDR Issei reported a problem with an unhealthy Typesense Cloud cluster. With the particular help of Jason and Kishore Nallan, they discovered that a problematic cron job was responsible. A solution, using a different endpoint for data export, was agreed on and implemented.

Powered by Struct AI

5

65
29mo
Solved
Join the chat
Apr 17, 2021 (32 months ago)
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
03:38 AM
Currently, My Typesense Cloud cluster is unhealthy. What can I do?
03:39
Issei
03:39 AM
😢
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:39 AM
Hmm, could you share your cluster ID?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
03:39 AM
ID: bw8oxjzkvrs3l562p
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:40 AM
Taking a look...
03:49
Jason
03:49 AM
May I know how many collections you had on this cluster?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
03:51 AM
This cluster has 1 collection: clips
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:06 AM
Still investigating it... But it looks like the Typesense process might be deadlocked for some reason. For now I've restarted it, but still looking into if it was indeed a deadlock or some other reason.

Cluster should be back up in about 10 mins
04:11
Jason
04:11 AM
Issei it's back up now.
04:11
Jason
04:11 AM
Still investigating root cause
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
04:12 AM
Thank you, Jason 👍 Let me know if you find out what’s causing this.

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:15 AM
Do you happen to have a cron that runs every hour on your side?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
04:16 AM
Yes, I have it. Is that what was causing the load?
04:18
Issei
04:18 AM
Currently, I have stopped that cron.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:18 AM
Is there anything special about the cron run on 4/16 at 0200hrs UTC? It seems like the load was particularly heavy at the time and after that there was sustained load on the cluster...
04:28
Jason
04:28 AM
Also, may I know what Typesense API endpoints the cron job hits?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
04:28 AM
Every hour after that time, a cron was executed to retrieve 900,000 records, 250 records per page.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:30 AM
Was this a new change after 4/16 0200hrs UTC, or did the same cron job run before this as well?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
04:30 AM
The cron was requesting /documents/search
04:33
Issei
04:33 AM
I changed the cron job at 4/16 0142 UTC. This modified cron was first run at 4/16 0200 UTC and hourly thereafter.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:34 AM
Ah ok. Could you share the search parameters you're using for each API call to Typesense, or even the code snippet you're using to call the Typesense search endpoint? Feel free to DM me this, if it's sensitive to share in the main channel
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
04:41 AM
Requested with parameters like this:
 while (true) {
     $page++;
     $params = [
         'q' => '*',
         'filter_by' => 'vc:>100',
         'include_fields' => 'id,b.l,at',
         'page' => $page,
         'per_page' => 250,
     ];
    ...
 }
04:44
Issei
04:44 AM
BEFORE 4/16 0142 UTC
         'filter_by' => 'vc:>1000',
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:44 AM
Cool, I was going to ask.
04:45
Jason
04:45 AM
May I know the total number of records that were returned with 'filter_by' => 'vc:>1000' , before the change
04:45
Jason
04:45 AM
a ballpark is fine
04:46
Jason
04:46 AM
Trying to see if the number of records in the search results might have caused an issue
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
04:47 AM
That’s about 200,000 records.
04:49
Issei
04:49 AM
Right now

'filter_by' => 'vc:>1000'
{
  "facet_counts": [],
  "found": 207397,

'filter_by' => 'vc:>100'
{
  "facet_counts": [],
  "found": 927449,
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:53 AM
Could you manually run the cron job now with vc:>100? I'd like to see if we can replicate the issue.
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
04:56 AM
Sure. Hold on a second.

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:56 AM
If we can indeed replicate the issue, this could potentially be due to a related bug with numeric filtering in v0.19, that we fixed in v0.20. We can then upgrade you to the latest v0.20 RC build and see if this filter fix also fixes this issue.
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
04:57 AM
The cron is running now

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:09 AM
So far so good on my end
05:13
Jason
05:13 AM
Do you know how many records have been paged through so far?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
05:15 AM
about 330,000 records
05:18
Issei
05:18 AM
I sent a simple search request to the cluster, but the response seems a little slow.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:19 AM
Yeah, I just noticed that the health endpoint is also slowing down in response times.
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
05:19 AM
/documents/search?q=hello&query_by=t&per_page=10
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:19 AM
So the node is slowly starting to get saturated
05:25
Jason
05:25 AM
Also, do you have the cron running one search at a time, or is there any parallelization involved?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
05:30 AM
The cron runs one search at a time. Not parallel.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:33 AM
Got it. Question: may I know if you're essentially trying to export a subset of the data with this cron job? Does the order of records returned matter for your use case?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
05:34 AM
Oops, this cron job was executed at 04:57 and 05:00 UTC.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:35 AM
Oh so there are two instances running in parallel?
05:35
Jason
05:35 AM
I wonder if this happened previously as well... May be the cron job from one hour, overlapped with the cron from the next hour, if the cron took that long to complete?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
05:36 AM
yes, there are manually execution and scheduled execution 😢
05:39
Issei
05:39 AM
I will change the schedule.

> Does the order of records returned matter for your use case?
I don’t care in this case.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:42 AM
If the goal is to export a subset of filtered data, then I wonder if we add a feature that allows you to add a filter_by param to the documents/export endpoint, that might be better suited for your use case
05:44
Jason
05:44 AM
The search endpoint is optimized for quick sorted+filtered results for text-based search, not for bulk exports of data. We meant the documents/export endpoint to be that...
05:44
Jason
05:44 AM
Would that work for you?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
05:45 AM
That’s nice 😄
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:48 AM
Ok cool! I see that you also use the include_fields param in your search query. Do you use or see a need for any other search param to be supported in the export endpoint?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
05:54 AM
filter_by and include_fields are enough for me!
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:55 AM
Ok great. I'll add this to backlog. In the meantime, could you check how long the current cron job takes to run and make sure two cron jobs don't overlap?
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
06:03 AM
Thank you! Yes, I already changed the schedule.

1

Jun 28, 2021 (29 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:47 AM
Issei We've added support for filter_by and include_fields in this RC build: https://github.com/typesense/typesense/issues/283#issuecomment-869189177

If you can give it a spin locally, and want to upgrade, I can upgrade your cloud cluster.
Jul 05, 2021 (29 months ago)
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
03:43 AM
That’s great! Cloud you upgrade my cluster?
03:45
Issei
03:45 AM
By the way, my cluster is unhealthy now.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:48 AM
It looks like the CPU is pegged to 100% now so it's showing up as unhealthy in the dashboard. Here's the last 2 days of CPU profile.
03:49
Kishore Nallan
03:49 AM
Last 30 mins. Hopefully once we upgrade and you can switch to using the export end-point this CPU issue will go away.
03:51
Kishore Nallan
03:51 AM
We're going to announce general availability of 0.21 in a day or so. Can we upgrade your cluster once the GA is out?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:44 PM
Just got a page that the cluster is unresponsive. Looks like CPU has been pegged for sufficient amount of time to affect availability. I've gone ahead and upgraded the cluster to the latest RC build for v0.21.

Issei Could you update your integration to use the documents/export endpoint as described here: https://github.com/typesense/typesense/issues/283#issuecomment-869189177
Jul 06, 2021 (29 months ago)
Issei
Photo of md5-756d5da34cc5127c88730a39db749024
Issei
01:13 AM
Thank you guys! I’m checking now…
02:19
Issei
02:19 AM
I need few hours to integrate new export endpoint. So, I stopped the cron job we discussed for now.

1