<@U01NZ4D0LDD> <@U01PL2YSG8L> I think I have come ...
# community-help
l
@Jason Bosco @Kishore Nallan I think I have come up with a solution for fixing the HA issues when running TS in k8s. https://github.com/typesense/typesense/issues/465 I started off with the sidecar idea but went a slightly different tack. Instead of each node having its own sidecar that wrote to a mapped volume, I have an app that monitors the TS namespace and when a pod recycles it updates the config map instead. This approach solves pretty much every scenario except when all pods get recycled at once. But I think we could fix that edge case as well with a minor tweak to the TS codebase. When a node comes up leaderless it doesn't appear to check the config map again. Adding some logic to recheck the config map when leaderless should solve that. If there's interest I have tentative approval from my company to open source it.
🔥 1
👀 1
j
I’ll let Kishore speak to this
k
@Lane Goolsby
When a node comes up leaderless it doesn't appear to check the config map again. Adding some logic to recheck the config map when leaderless should solve that.
We do this because resetting of peers could lead to data loss, if for e.g. a current leader has buffered the write but has not sent it to the followers yet. In this scenario, if the peers are force reset, then that buffered write could be lost. Let me see if I can find a safe way to handle this.
s
except when all pods get recycled at once
If it's configured as a StatefulSet and has readiness probe this should never happen. Unless massive node failure, but that should be solved via having nodes in multi-az
l
If it's configured as a StatefulSet and has readiness probe this should never happen. Unless massive node failure, but that should be solved via having nodes in multi-az
Yeah, that's how we have our cluster setup.
After playing around with the first attempt I mentioned I came up with (what I believe) to be a better solution. Instead of a Watcher I moved the logic into a InitContainer. The InitContainer does what the code did before, but does it before the TS pod starts up. This appears to handle when all the pods are restarted at once and when individual pods are recycled. When a new pod comes up it takes about 30 seconds for the other TS pods to refresh its values in the config map. Once that happens the new pod gets brought into the cluster cleanly.
k
Are you sure it's able to handle 3 pods getting rotated out at the same time (i.e. all of their IPs changing)? One of the problems that people have faced is that the old pod IPs are persisted in an internal state and it's not possible to recover without a force reset because the cluster still expects the pods to join back with the old IPs.
s
Also I have the question how that implementation solves when pods rotate. The new pod will get all IPs updated when start, but how does an old pod discovers the new pod IP?
k
Ok, I've figured out a way to make cluster auto reset the peer list on error during startup (in the case when all pods are rotated and they come up with new IPs). Check:
typesense/typesense:0.25.0.rc18
Docker build. To enabled this feature, you have to set the
--reset-peers-on-error
flag or set
TYPESENSE_RESET_PEERS_ON_ERROR=TRUE
environment variable. Try it out and let me know if this works @Sergio Behrends @Lane Goolsby -- hopefully we can put this whole peer change issue behind us with this fix.
❤️ 1
l
Are you sure it's able to handle 3 pods getting rotated out at the same time
Yes, on my local machine using Rancher Desktop with no disruption budget I can delete one-to-n nodes at a time and the cluster seems to handle it gracefully for the most part. Things can get out of whack if you try hard enough, but you have to be a little intentional. Currently the code just sleeps for a few seconds to make sure the cluster has allocated all the pending pods long enough for each to get an IP. I could easily extend it to to wait for some sort of minimum pod count instead but for my purposes right now its Good Enough™. I'm still letting things marinate in my environment. There's still an edge case I'm tracking down. Its not related to pods recycling. My current suspicion is its network related. Our production environment has had the leaderless cluster issue 2, maybe 3, times in a year. However our lower environment has it on nearly daily basis. I'm 99.5% confident its not because the pods are recycling, its something more fundamental than that (or else one of our sys admins has been playing a practical joke on me 😅).
The new pod will get all IPs updated when start, but how does an old pod discovers the new pod IP?
The config map is updated instantly across all pods. So its purely a matter of how long it takes for the old pods to realize a config map update was done. In my testing that's somewhere between 30-60 seconds (I haven't actually timed it so that's a guesstimate). If there could be some sort of listener for when the config map gets updated that would make this whole thing nearly instantaneous.
I should have prefaced that^ with the fact that I'm not testing writes in great depth. We may run +/-5 content crawls in a day. If we lose a little data because of a blip I don't really care. We're just indexing documentation so if we're slightly out date on a couple pages for a bit its not a big deal.
s
The config map is updated instantly across all pods.
True that! Sounds an more ondemand solution which should work properly too! I am curious, do you have that script to share? It uses k8s api to update the config map right?
l
I am curious, do you have that script to share? It uses k8s api to update the config map right?
Correct, the code uses the .Net library for k8s. I am working with my company to open source the code. If there's interest I can press a bit harder.
👍 1
a
I would be interested as well! I'm evaluating Typesense for a search use case at my company and k8s deployment is one of the open issues we need to figure out
@Kishore Nallan
We do this because resetting of peers could lead to data loss, if for e.g. a current leader has buffered the write but has not sent it to the followers yet. In this scenario, if the peers are force reset, then that buffered write could be lost.
When you say "current leader has buffered the write but has not sent it to the followers yet" - do you mean the write is not committed yet? In this case I think thats acceptable since we should only expect writes to persist once committed, and any write api calls should not have returned a success yet. Is that thinking correct?
auto reset the peer list on error during startup
Just want to make sure I know what you mean by this. Does this mean the node will pull the latest IP addresses from the node file (which it previously would only do if the cluster has quorum)? If so I know you said here this was a potentially dangerous action that could lead to data loss. My thinking was that is should not be dangerous as you could just start a new election and it should not be possible for committed data to get lost. Does that thinking make sense? Or if not how did you get around the risk of data loss? also disclaimer I'm still getting context on this problem/the existing work arounds. So apologies if these questions are at all naive or repetitive :)
k
There are a lot of nuances with Raft, some of which are also down to the specific implementation details of a given raft library. The warning about reset_peers are from the raft library we use (braft) but I think in a state when all nodes are restarting from scrarch, calling this API should be safe because there are no ongoing writes in that state.
Does this mean the node will pull the latest IP addresses from the node file (which it previously would only do if the cluster has quorum)?
Correct.
👍 1
a
@Kishore Nallan Curious for opinions on anther deployment approach my team is considering. We are considering creating one service per typesense node. Services have a stable IP address, so this way we won't have to deal with changing IP addresses. Does this approach sound feasible or are we missing a possible drawback? We realize it would make autoscaling difficult but I think were fine with that tradeoff
l
If your IPs are stable then there's no worry. You'll never have to deal with any of this. These problems are only because of us trying to get TS to work in k8s, where dynamic IPs are causing all sorts of merry chaos.
👍 1
k
I agree with the above 👍
👍 1
d
Hi @Kishore Nallan! Could you please share tar.gz for 0.25.0.rc18?
k
tar.gz for latest rc build:
Copy code
<https://dl.typesense.org/releases/0.25.0.rc20/typesense-server-0.25.0.rc20-linux-amd64.tar.gz>
🫰 1