We've had a problem with one of our TS clusters th...
# community-help
l
We've had a problem with one of our TS clusters that's been happening for a while and I'm not sure what's happening. Every couple days, sometimes every day, our cluster will go offline. I am pretty sure the reason the cluster becomes leaderless is related to the HA issue with k8s that's been floating around for a while now. The problem is once I get a leader assigned to the cluster, the leader will dump these messages for hours.
Copy code
I20230314 15:02:03.030268   244 replicator.cpp:829] node default_group:10.212.140.222:8107:8108 send InstallSnapshotRequest to 10.212.139.219:8107:8108 term 244 last_included_term 242 last_included_index 145160 uri <remote://10.212.140.222:8107/-10046900924925>
I20230314 15:02:03.030870   226 replicator.cpp:896] received InstallSnapshotResponse from default_group:10.212.139.219:8107:8108 last_included_index 145160 last_included_term 242 success.
I20230314 15:02:03.031605   229 replicator.cpp:829] node default_group:10.212.140.222:8107:8108 send InstallSnapshotRequest to 10.212.139.219:8107:8108 term 244 last_included_term 242 last_included_index 145160 uri <remote://10.212.140.222:8107/-10046900924924>
I20230314 15:02:03.032146   244 replicator.cpp:896] received InstallSnapshotResponse from default_group:10.212.139.219:8107:8108 last_included_index 145160 last_included_term 242 success.
I20230314 15:02:03.032687   226 replicator.cpp:829] node default_group:10.212.140.222:8107:8108 send InstallSnapshotRequest to 10.212.139.219:8107:8108 term 244 last_included_term 242 last_included_index 145160 uri <remote://10.212.140.222:8107/-10046900924923>
I20230314 15:02:03.033185   229 replicator.cpp:896] received InstallSnapshotResponse from default_group:10.212.139.219:8107:8108 last_included_index 145160 last_included_term 242 success.
Eventually they stop, but it eats up the CPU during this time. The messages are basically the same. The thing that seems to be changing is the number at the end of the URL, and seems to be counting down. Here are three numbers I grabbed randomly from logs with around a minute in between each copy/paste. -10046901230479 -10046900976735 -10046900924916 There is no way we are producing this many deltas. We maybe do one crawl in this environment a day, and we're only indexing a couple thousand documents. This cluster is tiny in comparison to what TS can handle, so the cluster shouldn't be working this hard to catch up the followers. Any idea on what's going on?
k
This issue is probably a reflection of something odd going wrong with clustering on Kubernetes. This is the reason we can't yet recommend running HA on Kubernetes.
I'm fairly sure it's not a generic issue because we have thousands of live HA clusters, both on Typesense Cloud and off it (self hosted environments) and we have never seen this type of issue (with these logs you have posted) before.