#community-help

Production Instance Crash During Re-indexing Process

TLDR Greg experienced a cluster crash during re-indexing. Jason identified the issue and resolved it by upgrading resources and adjusting the retryIntervalSeconds parameter.

Powered by Struct AI
6
6mo
Solved
Join the chat
May 31, 2023 (6 months ago)
Greg
Photo of md5-11c0f771a29e2aa8d72ae9544cc39017
Greg
07:37 PM
I am running into an issue with our production instance. We have a process to re-index search. Basically, it deletes the collection, recreates it, then writes all of the records to Typesense. The search records mostly consist of books that are chunked into snippets. Each book can have thousands of records. We currently have about 550 books. I ran the process to re-index the records, and it appears to have crashed our cluster. Can someone please help me?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:49 PM
(Debugging over DM)
09:10
Jason
09:10 PM
Summary: With several parallel imports, there wasn’t enough CPU to process all of them, so Typesense should have returned a 503, and then each concurrent client ended up retrying the request again with 0.1s (the default retry timeout in the client), essentially causing a thundering herd issue and maxing out disk space with all the write data.

This happened even with connectionTimeoutSeconds set to 540.
09:10
Jason
09:10 PM
Once disk space was exhausted, the cluster went into an error state and had to be manually recovered
09:14
Jason
09:14 PM
We’re now upgrading to 4vCPU and 8GB RAM, which should give 40GB of disk space, and also setting retryIntervalSeconds to a random number between 10 - 60s, so we avoid another thundering herd issue during retries
09:38
Jason
09:38 PM
This solved the issue