Hi our cluster w HA just went down after trying to add a few typesense #community-help

Hi, our cluster (w/HA) just went down after trying...

Anirudh Atodaria

12/16/2024, 10:22 PM

Hi, our cluster (w/HA) just went down after trying to add a few fields (one of them being auto facet field -- https://typesense.org/docs/27.1/api/collections.html#with-auto-schema-detection). Initially we got error

422 other operation in progress

(something similar). After this, all the nodes went down (it started working after 15 mins, with the added fields). Was this because we tried to add a few fields? (I can imagine this being the case as we have large number of documents). In addition to that, we keep seeing high CPU usage (even after upgrading to 8vCPU). (While we do see higher number for search per second, not sure if it should affect the CPU this much). Is there a way to look at the logs to figure out what are some of the requests? (We aren't fetching embedding field, however some/many of our documents are quiet large, could that be a reason?). prev convo: https://typesense-community.slack.com/archives/C01P749MET0/p1733959104496539

Jason Bosco

12/16/2024, 10:28 PM

It looks like you were already pretty close to exhausting RAM prior to adding the new fields. And then when the new fields were added, that exhausted all available RAM on two of the nodes and so the OS ended up killing the Typesense process to preserve itself. This then caused the cluster to go down. You'd need sufficient RAM to hold your entire dataset in RAM (since Typesense is an in-memory datastore), otherwise without sufficient RAM behavior becomes unpredictable even in an HA cluster. To solve for this, you want to upgrade to the next RAM tier by going to Cluster Configuration > Modify

Anirudh Atodaria

12/16/2024, 10:33 PM

Thank you for the quick response! We've only utilized ~50-55% of the RAM as of now! (This from last 15 mins -- I do see higher usage throughout the day for some reason). We usually sit around 50-60% memory usage. 😕 In any case, high memory usage explains the issue.

Jason Bosco

12/16/2024, 10:35 PM

Searches (especially vector searches and auto-embedding) require additional RAM depending on the query. So you'd need more head-room than what's available to handle both the data and the searches

Anirudh Atodaria

12/16/2024, 10:35 PM

Also, do you think having large documents have any effect on CPU usage when performing search?

Jason Bosco

12/16/2024, 10:36 PM

Looking at recent slow queries that took more than 30s, almost all of them use auto-embedding. Auto-embedding requires running the ML model at search time and all the models are CPU resource intensive to run. So that's what's causing the huge spikes

🫡 1

👍 1

Anirudh Atodaria

12/16/2024, 10:36 PM

Searches (especially vector searches and auto-embedding) require additional RAM depending on the query. So you'd need more head-room than what's available to handle both the data and the searches

Ah I see, we're seeing a bit higher search volume if I'm not wrong, and since we use vector search, its using up more RAM. Thank you for clearing that up. 🙏

👍 1

Jason Bosco

12/16/2024, 10:39 PM

So yeah, long story short - for your type of queries (auto-embedding + vector searches) and volume of data 32GB 16vCPUs would be better suited

👍 1

Jason Bosco

12/16/2024, 10:39 PM

This guide talks about the system requirements: https://typesense.org/docs/guide/system-requirements.html

👍 1

Anirudh Atodaria

12/16/2024, 10:42 PM

That makes sense. I'll go through the docs as well. Thank you again for your time and efforts 🙏 🙌

👍 1

Anirudh Atodaria

12/17/2024, 6:35 PM

Hi, we queued an up the upgrade today morning (about ~2 hours ago), and it hasn't gone through and the cluster keep going down (it seems that search traffic increased as well and might be causing issues). Not sure but something might be wrong?

Anirudh Atodaria

12/17/2024, 7:06 PM

I think the its going down because of high RAM usage. However, I'm not sure if that is affecting the config upgrade we have in progress.

Anirudh Atodaria

12/17/2024, 10:24 PM

The upgrade went through (16vCPU, 32GB RAM). However, the traffic went up significantly as well and CPU usage hit 100% immediately after the upgrade again. We ended up upgrading to - 16vCPU, 64GB RAM, GPU and Faster Disk. (We couldn't get GPU acceleration for both 16vCPU + 32 RAM and 32vCPU + 64GB RAM configs for some reason).

Jason Bosco

12/17/2024, 11:03 PM

Yeah looks like not all nodes in the cluster were healthy when the previous config change was triggered, so it failed an internal checkpoint and so we had to step in to jump start it.

📝 1

🙌 1

Jason Bosco

12/17/2024, 11:03 PM

GPU acceleration is only available in some RAM / CPU configurations, and only in select regions

👍 1

Anirudh Atodaria

12/17/2024, 11:23 PM

okay, we just went all out

48vCPU

-- RN Typesense is not returning anything and the traffic is good for us. While the search volume is higher and we're using vector search, is it expected? (With the amount of traffic we have, with having 16vCPU and GPU acceleration?) (You probably know this already but we're on v28.0.rc27) Sorry for the vague questions here, just want to make sure that there is not something unexpected.

Jason Bosco

12/17/2024, 11:29 PM

I'm actually surprised that GPU acceleration didn't help. Which suggests to me that something else is causing high CPU usage, besides the embedding generation. We'll take a closer look and I'll keep you posted

Anirudh Atodaria

12/17/2024, 11:30 PM

Thank you so much, that'd be super helpful 🫡 🙌

Jason Bosco

12/17/2024, 11:58 PM

Ok found one issue that will definitely cause high CPU. There are semi-regular queries like this:

Copy code

{
  "searches": [
    {
      "filter_by": "productId: [about 200 IDs]",
      "per_page": 250,
      "q": "*",
      "group_by": "productId",
      "group_limit": 1,
      "query_by": "name",
      "query_by_weights": "1",
      "collection": "variants_v3"
    }
  ]
}

The search_time itself is about 12ms for this query. BUT, each document is about 300KB (and embeddings are a tiny part of the docs). So fetching 250 documents, results in a payload size of 75MB PER api call. Fetching this from disk and then compressing it to send it over the wire is what is causing the high I/O and hence high CPU usage. The way to solve this (besides adding a high number of CPU cores) is to reduce this payload size that is fetched through the wire, to just the fields required to display the results, using the

exclude_fields

include_fields

parameter. For eg: you definitely want to exclude the embedding field. I also see a large array field called

offers

which seems to exist at the top level and also repeated inside the

variants

field once again (at least from what I can tell). There's also a

priceHistory

field which seems to be super large. These seem to be the bulk of the document. Do you need these for display purposes on the search results page?

Anirudh Atodaria

12/18/2024, 12:18 AM

yeah some of them are required but not all I think, we'll review this once and update the queries accordingly. Thanks a ton for the detailed breakdown, this was very helpful 🙏

👍 1

Open in Slack

Previous Next