TS Cluster Issues on Kubernetes and High CPU Usage
TLDR Lane is experiencing an issue with TS clusters on Kubernetes, causing the cluster to go offline and leader to dump messages with high CPU usage. Kishore Nallan suggests it may be related to clustering issues on Kubernetes.
Mar 14, 2023 (7 months ago)
Lane
03:06 PMEvery couple days, sometimes every day, our cluster will go offline. I am pretty sure the reason the cluster becomes leaderless is related to the HA issue with k8s that's been floating around for a while now.
The problem is once I get a leader assigned to the cluster, the leader will dump these messages for hours.
I20230314 15:02:03.030268 244 replicator.cpp:829] node default_group:10.212.140.222:8107:8108 send InstallSnapshotRequest to 10.212.139.219:8107:8108 term 244 last_included_term 242 last_included_index 145160 uri
I20230314 15:02:03.030870 226 replicator.cpp:896] received InstallSnapshotResponse from default_group:10.212.139.219:8107:8108 last_included_index 145160 last_included_term 242 success.
I20230314 15:02:03.031605 229 replicator.cpp:829] node default_group:10.212.140.222:8107:8108 send InstallSnapshotRequest to 10.212.139.219:8107:8108 term 244 last_included_term 242 last_included_index 145160 uri
I20230314 15:02:03.032146 244 replicator.cpp:896] received InstallSnapshotResponse from default_group:10.212.139.219:8107:8108 last_included_index 145160 last_included_term 242 success.
I20230314 15:02:03.032687 226 replicator.cpp:829] node default_group:10.212.140.222:8107:8108 send InstallSnapshotRequest to 10.212.139.219:8107:8108 term 244 last_included_term 242 last_included_index 145160 uri
I20230314 15:02:03.033185 229 replicator.cpp:896] received InstallSnapshotResponse from default_group:10.212.139.219:8107:8108 last_included_index 145160 last_included_term 242 success.
Eventually they stop, but it eats up the CPU during this time.
The messages are basically the same. The thing that seems to be changing is the number at the end of the URL, and seems to be counting down. Here are three numbers I grabbed randomly from logs with around a minute in between each copy/paste.
-10046901230479
-10046900976735
-10046900924916
There is no way we are producing this many deltas. We maybe do one crawl in this environment a day, and we're only indexing a couple thousand documents. This cluster is tiny in comparison to what TS can handle, so the cluster shouldn't be working this hard to catch up the followers.
Any idea on what's going on?
Kishore Nallan
03:15 PMKishore Nallan
03:16 PMTypesense
Indexed 2779 threads (79% resolved)
Similar Threads
Debugging and Recovery of a Stuck Typesense Cluster
Charlie had a wedged staging cluster. Jason provided debugging and recovery steps, and Adrian helped with more insights. It turns out the issue was insufficient disk space. Once Adrian increased the disk size, the cluster healed itself.
Typesense Node Stuck in Segfault Loop After Stress Test
Adrian encountered a segfault loop when stress testing a Typesense cluster. Kishore Nallan recommended trying a newer RC build and suggested potential issues with hostname resolution.
Issues with Typesense and k8s Snapshot Restoration
Arnob experienced data loss and errors with Typesense in k8s. Kishore Nallan explained corruption could be from premature pod termination. To resolve, Kishore Nallan suggested deleting the data directory on the malfunctioning pod for automatic restoration from the leader.