Currently, My Typesense Cloud cluster is unhealthy...
# community-help
i
Currently, My Typesense Cloud cluster is unhealthy. What can I do?
😢
j
Hmm, could you share your cluster ID?
i
ID:
bw8oxjzkvrs3l562p
j
Taking a look...
May I know how many collections you had on this cluster?
i
This cluster has 1 collection:
clips
j
Still investigating it... But it looks like the Typesense process might be deadlocked for some reason. For now I've restarted it, but still looking into if it was indeed a deadlock or some other reason. Cluster should be back up in about 10 mins
@Issei Yuki it's back up now.
Still investigating root cause
i
Thank you, Jason 👍 Let me know if you find out what’s causing this.
👍 1
j
Do you happen to have a cron that runs every hour on your side?
i
Yes, I have it. Is that what was causing the load?
Currently, I have stopped that cron.
j
Is there anything special about the cron run on 4/16 at 0200hrs UTC? It seems like the load was particularly heavy at the time and after that there was sustained load on the cluster...
Also, may I know what Typesense API endpoints the cron job hits?
i
Every hour after that time, a cron was executed to retrieve 900,000 records, 250 records per page.
j
Was this a new change after 4/16 0200hrs UTC, or did the same cron job run before this as well?
i
The cron was requesting
/documents/search
I changed the cron job at 4/16 0142 UTC. This modified cron was first run at 4/16 0200 UTC and hourly thereafter.
j
Ah ok. Could you share the search parameters you're using for each API call to Typesense, or even the code snippet you're using to call the Typesense search endpoint? Feel free to DM me this, if it's sensitive to share in the main channel
i
Requested with parameters like this:
Copy code
while (true) {
     $page++;
     $params = [
         'q' => '*',
         'filter_by' => 'vc:>100',
         'include_fields' => 'id,b.l,at',
         'page' => $page,
         'per_page' => 250,
     ];
    ...
 }
BEFORE 4/16 0142 UTC
Copy code
'filter_by' => 'vc:>1000',
j
Cool, I was going to ask.
May I know the total number of records that were returned with
'filter_by' => 'vc:>1000'
, before the change
a ballpark is fine
Trying to see if the number of records in the search results might have caused an issue
i
That’s about 200,000 records.
Right now
'filter_by' => 'vc:>1000'
Copy code
{
  "facet_counts": [],
  "found": 207397,
'filter_by' => 'vc:>100'
Copy code
{
  "facet_counts": [],
  "found": 927449,
j
Could you manually run the cron job now with
vc:>100
? I'd like to see if we can replicate the issue.
i
Sure. Hold on a second.
👍 1
j
If we can indeed replicate the issue, this could potentially be due to a related bug with numeric filtering in v0.19, that we fixed in v0.20. We can then upgrade you to the latest v0.20 RC build and see if this filter fix also fixes this issue.
i
The cron is running now
👍 1
j
So far so good on my end
Do you know how many records have been paged through so far?
i
about 330,000 records
I sent a simple search request to the cluster, but the response seems a little slow.
j
Yeah, I just noticed that the health endpoint is also slowing down in response times.
i
/documents/search?q=hello&query_by=t&per_page=10
j
So the node is slowly starting to get saturated
Also, do you have the cron running one search at a time, or is there any parallelization involved?
i
The cron runs one search at a time. Not parallel.
j
Got it. Question: may I know if you're essentially trying to export a subset of the data with this cron job? Does the order of records returned matter for your use case?
i
Oops, this cron job was executed at 04:57 and 05:00 UTC.
j
Oh so there are two instances running in parallel?
I wonder if this happened previously as well... May be the cron job from one hour, overlapped with the cron from the next hour, if the cron took that long to complete?
i
yes, there are manually execution and scheduled execution 😢
I will change the schedule.
Does the order of records returned matter for your use case?
I don’t care in this case.
j
If the goal is to export a subset of filtered data, then I wonder if we add a feature that allows you to add a
filter_by
param to the
documents/export
endpoint, that might be better suited for your use case
The search endpoint is optimized for quick sorted+filtered results for text-based search, not for bulk exports of data. We meant the documents/export endpoint to be that...
Would that work for you?
i
That’s nice 😄
j
Ok cool! I see that you also use the
include_fields
param in your search query. Do you use or see a need for any other search param to be supported in the export endpoint?
i
filter_by
and
include_fields
are enough for me!
j
Ok great. I'll add this to backlog. In the meantime, could you check how long the current cron job takes to run and make sure two cron jobs don't overlap?
i
Thank you! Yes, I already changed the schedule.
👍 1
k
@Issei Yuki We've added support for filter_by and include_fields in this RC build: https://github.com/typesense/typesense/issues/283#issuecomment-869189177 If you can give it a spin locally, and want to upgrade, I can upgrade your cloud cluster.
i
That’s great! Cloud you upgrade my cluster?
By the way, my cluster is unhealthy now.
k
It looks like the CPU is pegged to 100% now so it's showing up as unhealthy in the dashboard. Here's the last 2 days of CPU profile.
Last 30 mins. Hopefully once we upgrade and you can switch to using the export end-point this CPU issue will go away.
We're going to announce general availability of 0.21 in a day or so. Can we upgrade your cluster once the GA is out?
j
Just got a page that the cluster is unresponsive. Looks like CPU has been pegged for sufficient amount of time to affect availability. I've gone ahead and upgraded the cluster to the latest RC build for v0.21. @Issei Yuki Could you update your integration to use the documents/export endpoint as described here: https://github.com/typesense/typesense/issues/283#issuecomment-869189177
i
Thank you guys! I’m checking now…
I need few hours to integrate new export endpoint. So, I stopped the cron job we discussed for now.
👍 1