I am running into an issue with our production instance We h typesense #community-help

I am running into an issue with our production ins...

Greg Mascherino

05/31/2023, 7:37 PM

I am running into an issue with our production instance. We have a process to re-index search. Basically, it deletes the collection, recreates it, then writes all of the records to Typesense. The search records mostly consist of books that are chunked into snippets. Each book can have thousands of records. We currently have about 550 books. I ran the process to re-index the records, and it appears to have crashed our cluster. Can someone please help me?

Jason Bosco

05/31/2023, 8:49 PM

(Debugging over DM)

Jason Bosco

05/31/2023, 9:10 PM

Summary: With several parallel imports, there wasn’t enough CPU to process all of them, so Typesense should have returned a 503, and then each concurrent client ended up retrying the request again with 0.1s (the default retry timeout in the client), essentially causing a thundering herd issue and maxing out disk space with all the write data. This happened even with

connectionTimeoutSeconds

set to

Jason Bosco

05/31/2023, 9:10 PM

Once disk space was exhausted, the cluster went into an error state and had to be manually recovered

Jason Bosco

05/31/2023, 9:14 PM

We’re now upgrading to 4vCPU and 8GB RAM, which should give 40GB of disk space, and also setting

retryIntervalSeconds

to a random number between 10 - 60s, so we avoid another thundering herd issue during retries

Jason Bosco

05/31/2023, 9:38 PM

This solved the issue

Open in Slack

Previous Next