I am running into an issue with our production ins...
# community-help
g
I am running into an issue with our production instance. We have a process to re-index search. Basically, it deletes the collection, recreates it, then writes all of the records to Typesense. The search records mostly consist of books that are chunked into snippets. Each book can have thousands of records. We currently have about 550 books. I ran the process to re-index the records, and it appears to have crashed our cluster. Can someone please help me?
j
(Debugging over DM)
Summary: With several parallel imports, there wasn’t enough CPU to process all of them, so Typesense should have returned a 503, and then each concurrent client ended up retrying the request again with 0.1s (the default retry timeout in the client), essentially causing a thundering herd issue and maxing out disk space with all the write data. This happened even with
connectionTimeoutSeconds
set to
540
.
Once disk space was exhausted, the cluster went into an error state and had to be manually recovered
We’re now upgrading to 4vCPU and 8GB RAM, which should give 40GB of disk space, and also setting
retryIntervalSeconds
to a random number between 10 - 60s, so we avoid another thundering herd issue during retries
This solved the issue