Currently My Typesense Cloud cluster is unhealthy What can I typesense #community-help

Join Slack

Currently, My Typesense Cloud cluster is unhealthy...

# community-help

Issei Yuki

04/17/2021, 3:38 AM

Currently, My Typesense Cloud cluster is unhealthy. What can I do?

Issei Yuki

04/17/2021, 3:39 AM

😢

Jason Bosco

04/17/2021, 3:39 AM

Hmm, could you share your cluster ID?

Issei Yuki

04/17/2021, 3:39 AM

ID:

bw8oxjzkvrs3l562p

Jason Bosco

04/17/2021, 3:40 AM

Taking a look...

Jason Bosco

04/17/2021, 3:49 AM

May I know how many collections you had on this cluster?

Issei Yuki

04/17/2021, 3:51 AM

This cluster has 1 collection:

clips

Jason Bosco

04/17/2021, 4:06 AM

Still investigating it... But it looks like the ~~Typesense process might be deadlocked for some reason~~. For now I've restarted it, but still looking into if it was indeed a deadlock or some other reason. Cluster should be back up in about 10 mins

Jason Bosco

04/17/2021, 4:11 AM

@Issei Yuki it's back up now.

Jason Bosco

04/17/2021, 4:11 AM

Still investigating root cause

Issei Yuki

04/17/2021, 4:12 AM

Thank you, Jason 👍 Let me know if you find out what’s causing this.

👍 1

Jason Bosco

04/17/2021, 4:15 AM

Do you happen to have a cron that runs every hour on your side?

Issei Yuki

04/17/2021, 4:16 AM

Yes, I have it. Is that what was causing the load?

Issei Yuki

04/17/2021, 4:18 AM

Currently, I have stopped that cron.

Jason Bosco

04/17/2021, 4:18 AM

Is there anything special about the cron run on 4/16 at 0200hrs UTC? It seems like the load was particularly heavy at the time and after that there was sustained load on the cluster...

Jason Bosco

04/17/2021, 4:28 AM

Also, may I know what Typesense API endpoints the cron job hits?

Issei Yuki

04/17/2021, 4:28 AM

Every hour after that time, a cron was executed to retrieve 900,000 records, 250 records per page.

Jason Bosco

04/17/2021, 4:30 AM

Was this a new change after 4/16 0200hrs UTC, or did the same cron job run before this as well?

Issei Yuki

04/17/2021, 4:30 AM

The cron was requesting

/documents/search

Issei Yuki

04/17/2021, 4:33 AM

I changed the cron job at 4/16 0142 UTC. This modified cron was first run at 4/16 0200 UTC and hourly thereafter.

Jason Bosco

04/17/2021, 4:34 AM

Ah ok. Could you share the search parameters you're using for each API call to Typesense, or even the code snippet you're using to call the Typesense search endpoint? Feel free to DM me this, if it's sensitive to share in the main channel

Issei Yuki

04/17/2021, 4:41 AM

Requested with parameters like this:

Copy code

while (true) {
     $page++;
     $params = [
         'q' => '*',
         'filter_by' => 'vc:>100',
         'include_fields' => 'id,b.l,at',
         'page' => $page,
         'per_page' => 250,
     ];
    ...
 }

Issei Yuki

04/17/2021, 4:44 AM

BEFORE 4/16 0142 UTC

Copy code

'filter_by' => 'vc:>1000',

Jason Bosco

04/17/2021, 4:44 AM

Cool, I was going to ask.

Jason Bosco

04/17/2021, 4:45 AM

May I know the total number of records that were returned with

'filter_by' => 'vc:>1000'

, before the change

Jason Bosco

04/17/2021, 4:45 AM

a ballpark is fine

Jason Bosco

04/17/2021, 4:46 AM

Trying to see if the number of records in the search results might have caused an issue

Issei Yuki

04/17/2021, 4:47 AM

That’s about 200,000 records.

Issei Yuki

04/17/2021, 4:49 AM

Right now

'filter_by' => 'vc:>1000'

Copy code

{
  "facet_counts": [],
  "found": 207397,

'filter_by' => 'vc:>100'

Copy code

{
  "facet_counts": [],
  "found": 927449,

Jason Bosco

04/17/2021, 4:53 AM

Could you manually run the cron job now with

vc:>100

? I'd like to see if we can replicate the issue.

Issei Yuki

04/17/2021, 4:56 AM

Sure. Hold on a second.

👍 1

Jason Bosco

04/17/2021, 4:56 AM

If we can indeed replicate the issue, this could potentially be due to a related bug with numeric filtering in v0.19, that we fixed in v0.20. We can then upgrade you to the latest v0.20 RC build and see if this filter fix also fixes this issue.

Issei Yuki

04/17/2021, 4:57 AM

The cron is running now

👍 1

Jason Bosco

04/17/2021, 5:09 AM

So far so good on my end

Jason Bosco

04/17/2021, 5:13 AM

Do you know how many records have been paged through so far?

Issei Yuki

04/17/2021, 5:15 AM

about 330,000 records

Issei Yuki

04/17/2021, 5:18 AM

I sent a simple search request to the cluster, but the response seems a little slow.

Jason Bosco

04/17/2021, 5:19 AM

Yeah, I just noticed that the health endpoint is also slowing down in response times.

Issei Yuki

04/17/2021, 5:19 AM

/documents/search?q=hello&query_by=t&per_page=10

Jason Bosco

04/17/2021, 5:19 AM

So the node is slowly starting to get saturated

Jason Bosco

04/17/2021, 5:25 AM

Also, do you have the cron running one search at a time, or is there any parallelization involved?

Issei Yuki

04/17/2021, 5:30 AM

The cron runs one search at a time. Not parallel.

Jason Bosco

04/17/2021, 5:33 AM

Got it. Question: may I know if you're essentially trying to export a subset of the data with this cron job? Does the order of records returned matter for your use case?

Issei Yuki

04/17/2021, 5:34 AM

Oops, this cron job was executed at 04:57 and 05:00 UTC.

Jason Bosco

04/17/2021, 5:35 AM

Oh so there are two instances running in parallel?

Jason Bosco

04/17/2021, 5:35 AM

I wonder if this happened previously as well... May be the cron job from one hour, overlapped with the cron from the next hour, if the cron took that long to complete?

Issei Yuki

04/17/2021, 5:36 AM

yes, there are manually execution and scheduled execution 😢

Issei Yuki

04/17/2021, 5:39 AM

I will change the schedule.

Does the order of records returned matter for your use case?

I don’t care in this case.

Jason Bosco

04/17/2021, 5:42 AM

If the goal is to export a subset of filtered data, then I wonder if we add a feature that allows you to add a

filter_by

param to the

documents/export

endpoint, that might be better suited for your use case

Jason Bosco

04/17/2021, 5:44 AM

The search endpoint is optimized for quick sorted+filtered results for text-based search, not for bulk exports of data. We meant the documents/export endpoint to be that...

Jason Bosco

04/17/2021, 5:44 AM

Would that work for you?

Issei Yuki

04/17/2021, 5:45 AM

That’s nice 😄

Jason Bosco

04/17/2021, 5:48 AM

Ok cool! I see that you also use the

include_fields

param in your search query. Do you use or see a need for any other search param to be supported in the export endpoint?

Issei Yuki

04/17/2021, 5:54 AM

filter_by

and

include_fields

are enough for me!

Jason Bosco

04/17/2021, 5:55 AM

Ok great. I'll add this to backlog. In the meantime, could you check how long the current cron job takes to run and make sure two cron jobs don't overlap?

Issei Yuki

04/17/2021, 6:03 AM

Thank you! Yes, I already changed the schedule.

👍 1

Kishore Nallan

06/28/2021, 3:47 AM

@Issei Yuki We've added support for filter_by and include_fields in this RC build: https://github.com/typesense/typesense/issues/283#issuecomment-869189177 If you can give it a spin locally, and want to upgrade, I can upgrade your cloud cluster.

Issei Yuki

07/05/2021, 3:43 AM

That’s great! Cloud you upgrade my cluster?

Issei Yuki

07/05/2021, 3:45 AM

By the way, my cluster is unhealthy now.

Kishore Nallan

07/05/2021, 3:48 AM

It looks like the CPU is pegged to 100% now so it's showing up as unhealthy in the dashboard. Here's the last 2 days of CPU profile.

Kishore Nallan

07/05/2021, 3:49 AM

Last 30 mins. Hopefully once we upgrade and you can switch to using the export end-point this CPU issue will go away.

Kishore Nallan

07/05/2021, 3:51 AM

We're going to announce general availability of 0.21 in a day or so. Can we upgrade your cluster once the GA is out?

Jason Bosco

07/05/2021, 4:44 PM

Just got a page that the cluster is unresponsive. Looks like CPU has been pegged for sufficient amount of time to affect availability. I've gone ahead and upgraded the cluster to the latest RC build for v0.21. @Issei Yuki Could you update your integration to use the documents/export endpoint as described here: https://github.com/typesense/typesense/issues/283#issuecomment-869189177

Issei Yuki

07/06/2021, 1:13 AM

Thank you guys! I’m checking now…

Issei Yuki

07/06/2021, 2:19 AM

I need few hours to integrate new export endpoint. So, I stopped the cron job we discussed for now.

👍 1

2 Views

Open in Slack

Previous Next