#community-help

TS Cluster Issues on Kubernetes and High CPU Usage

TLDR Lane is experiencing an issue with TS clusters on Kubernetes, causing the cluster to go offline and leader to dump messages with high CPU usage. Kishore Nallan suggests it may be related to clustering issues on Kubernetes.

Powered by Struct AI
Mar 14, 2023 (7 months ago)
Lane
Photo of md5-c793ac7faa870e19aa043d1f9b35abd1
Lane
03:06 PM
We've had a problem with one of our TS clusters that's been happening for a while and I'm not sure what's happening.

Every couple days, sometimes every day, our cluster will go offline. I am pretty sure the reason the cluster becomes leaderless is related to the HA issue with k8s that's been floating around for a while now.

The problem is once I get a leader assigned to the cluster, the leader will dump these messages for hours.
I20230314 15:02:03.030268   244 replicator.cpp:829] node default_group:10.212.140.222:8107:8108 send InstallSnapshotRequest to 10.212.139.219:8107:8108 term 244 last_included_term 242 last_included_index 145160 uri 
I20230314 15:02:03.030870   226 replicator.cpp:896] received InstallSnapshotResponse from default_group:10.212.139.219:8107:8108 last_included_index 145160 last_included_term 242 success.
I20230314 15:02:03.031605   229 replicator.cpp:829] node default_group:10.212.140.222:8107:8108 send InstallSnapshotRequest to 10.212.139.219:8107:8108 term 244 last_included_term 242 last_included_index 145160 uri 
I20230314 15:02:03.032146   244 replicator.cpp:896] received InstallSnapshotResponse from default_group:10.212.139.219:8107:8108 last_included_index 145160 last_included_term 242 success.
I20230314 15:02:03.032687   226 replicator.cpp:829] node default_group:10.212.140.222:8107:8108 send InstallSnapshotRequest to 10.212.139.219:8107:8108 term 244 last_included_term 242 last_included_index 145160 uri 
I20230314 15:02:03.033185   229 replicator.cpp:896] received InstallSnapshotResponse from default_group:10.212.139.219:8107:8108 last_included_index 145160 last_included_term 242 success.

Eventually they stop, but it eats up the CPU during this time.

The messages are basically the same. The thing that seems to be changing is the number at the end of the URL, and seems to be counting down. Here are three numbers I grabbed randomly from logs with around a minute in between each copy/paste.

-10046901230479
-10046900976735
-10046900924916

There is no way we are producing this many deltas. We maybe do one crawl in this environment a day, and we're only indexing a couple thousand documents. This cluster is tiny in comparison to what TS can handle, so the cluster shouldn't be working this hard to catch up the followers.

Any idea on what's going on?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:15 PM
This issue is probably a reflection of something odd going wrong with clustering on Kubernetes. This is the reason we can't yet recommend running HA on Kubernetes.
03:16
Kishore Nallan
03:16 PM
I'm fairly sure it's not a generic issue because we have thousands of live HA clusters, both on Typesense Cloud and off it (self hosted environments) and we have never seen this type of issue (with these logs you have posted) before.