Hi, our cluster (w/HA) just went down after trying...
# community-help
a
Hi, our cluster (w/HA) just went down after trying to add a few fields (one of them being auto facet field -- https://typesense.org/docs/27.1/api/collections.html#with-auto-schema-detection). Initially we got error
422 other operation in progress
(something similar). After this, all the nodes went down (it started working after 15 mins, with the added fields). Was this because we tried to add a few fields? (I can imagine this being the case as we have large number of documents). In addition to that, we keep seeing high CPU usage (even after upgrading to 8vCPU). (While we do see higher number for search per second, not sure if it should affect the CPU this much). Is there a way to look at the logs to figure out what are some of the requests? (We aren't fetching embedding field, however some/many of our documents are quiet large, could that be a reason?). prev convo: https://typesense-community.slack.com/archives/C01P749MET0/p1733959104496539
j
It looks like you were already pretty close to exhausting RAM prior to adding the new fields. And then when the new fields were added, that exhausted all available RAM on two of the nodes and so the OS ended up killing the Typesense process to preserve itself. This then caused the cluster to go down. You'd need sufficient RAM to hold your entire dataset in RAM (since Typesense is an in-memory datastore), otherwise without sufficient RAM behavior becomes unpredictable even in an HA cluster. To solve for this, you want to upgrade to the next RAM tier by going to Cluster Configuration > Modify
a
Thank you for the quick response! We've only utilized ~50-55% of the RAM as of now! (This from last 15 mins -- I do see higher usage throughout the day for some reason). We usually sit around 50-60% memory usage. 😕 In any case, high memory usage explains the issue.
j
Searches (especially vector searches and auto-embedding) require additional RAM depending on the query. So you'd need more head-room than what's available to handle both the data and the searches
a
Also, do you think having large documents have any effect on CPU usage when performing search?
j
Looking at recent slow queries that took more than 30s, almost all of them use auto-embedding. Auto-embedding requires running the ML model at search time and all the models are CPU resource intensive to run. So that's what's causing the huge spikes
🫡 1
👍 1
a
Searches (especially vector searches and auto-embedding) require additional RAM depending on the query. So you'd need more head-room than what's available to handle both the data and the searches
Ah I see, we're seeing a bit higher search volume if I'm not wrong, and since we use vector search, its using up more RAM. Thank you for clearing that up. 🙏
👍 1
j
So yeah, long story short - for your type of queries (auto-embedding + vector searches) and volume of data 32GB 16vCPUs would be better suited
👍 1
This guide talks about the system requirements: https://typesense.org/docs/guide/system-requirements.html
👍 1
a
That makes sense. I'll go through the docs as well. Thank you again for your time and efforts 🙏 🙌
👍 1
Hi, we queued an up the upgrade today morning (about ~2 hours ago), and it hasn't gone through and the cluster keep going down (it seems that search traffic increased as well and might be causing issues). Not sure but something might be wrong?
I think the its going down because of high RAM usage. However, I'm not sure if that is affecting the config upgrade we have in progress.
The upgrade went through (16vCPU, 32GB RAM). However, the traffic went up significantly as well and CPU usage hit 100% immediately after the upgrade again. We ended up upgrading to - 16vCPU, 64GB RAM, GPU and Faster Disk. (We couldn't get GPU acceleration for both 16vCPU + 32 RAM and 32vCPU + 64GB RAM configs for some reason).
j
Yeah looks like not all nodes in the cluster were healthy when the previous config change was triggered, so it failed an internal checkpoint and so we had to step in to jump start it.
📝 1
🙌 1
GPU acceleration is only available in some RAM / CPU configurations, and only in select regions
👍 1
a
okay, we just went all out
48vCPU
-- RN Typesense is not returning anything and the traffic is good for us. While the search volume is higher and we're using vector search, is it expected? (With the amount of traffic we have, with having 16vCPU and GPU acceleration?) (You probably know this already but we're on v28.0.rc27) Sorry for the vague questions here, just want to make sure that there is not something unexpected.
j
I'm actually surprised that GPU acceleration didn't help. Which suggests to me that something else is causing high CPU usage, besides the embedding generation. We'll take a closer look and I'll keep you posted
a
Thank you so much, that'd be super helpful 🫡 🙌
j
Ok found one issue that will definitely cause high CPU. There are semi-regular queries like this:
Copy code
{
  "searches": [
    {
      "filter_by": "productId: [about 200 IDs]",
      "per_page": 250,
      "q": "*",
      "group_by": "productId",
      "group_limit": 1,
      "query_by": "name",
      "query_by_weights": "1",
      "collection": "variants_v3"
    }
  ]
}
The search_time itself is about 12ms for this query. BUT, each document is about 300KB (and embeddings are a tiny part of the docs). So fetching 250 documents, results in a payload size of 75MB PER api call. Fetching this from disk and then compressing it to send it over the wire is what is causing the high I/O and hence high CPU usage. The way to solve this (besides adding a high number of CPU cores) is to reduce this payload size that is fetched through the wire, to just the fields required to display the results, using the
exclude_fields
or
include_fields
parameter. For eg: you definitely want to exclude the embedding field. I also see a large array field called
offers
which seems to exist at the top level and also repeated inside the
variants
field once again (at least from what I can tell). There's also a
priceHistory
field which seems to be super large. These seem to be the bulk of the document. Do you need these for display purposes on the search results page?
a
yeah some of them are required but not all I think, we'll review this once and update the queries accordingly. Thanks a ton for the detailed breakdown, this was very helpful 🙏
👍 1