Our setup:
• Typesense Cluster in Kubernetes as statefulset in 3 replicas
• Before routing traffic to the Typesense instance, Kubernetes checks health of it by 200 in healthcheck endpoint. It is done every 10 seconds
• Write/read lag threshold is set up to 500
• Indexation is done every hour, updating ~300k rows
When node restarts, it (from my understanding) goes:
• Indexing from the snapshot
• Syncing with a cluster
• Marking itself as ready
If for some reason node was restarted while indexation was in progress, then it is possible for node to has flaky state because of gradually pulling new data and increasing write/read lag:
• Indexing from the snapshot
• Starting synchronization with a cluster
• Marking itself as ready [Kubernetes starts routing traffic to it]
• Pulling new data from cluster, lag is 100, 200, 300, …
• When lag increased to 500, node marks itself as unhealthy [Kubernetes stops traffic]
Because Kubernetes checks health status every 10s, it is possible for a flaky node to serve traffic and return 503 errors to clients. We fixed it on our side with hiding instances behind a proxy server with
retry non_idempotent http_503
, but maybe you can suggest something else? Maybe it is also a good idea to somehow estimate lag on start and not wait for it to gradually breach threshold