Lane Goolsby
03/29/2023, 1:44 AMAre you sure it's able to handle 3 pods getting rotated out at the same timeYes, on my local machine using Rancher Desktop with no disruption budget I can delete one-to-n nodes at a time and the cluster seems to handle it gracefully for the most part. Things can get out of whack if you try hard enough, but you have to be a little intentional. Currently the code just sleeps for a few seconds to make sure the cluster has allocated all the pending pods long enough for each to get an IP. I could easily extend it to to wait for some sort of minimum pod count instead but for my purposes right now its Good Enough™. I'm still letting things marinate in my environment. There's still an edge case I'm tracking down. Its not related to pods recycling. My current suspicion is its network related. Our production environment has had the leaderless cluster issue 2, maybe 3, times in a year. However our lower environment has it on nearly daily basis. I'm 99.5% confident its not because the pods are recycling, its something more fundamental than that (or else one of our sys admins has been playing a practical joke on me 😅).
The new pod will get all IPs updated when start, but how does an old pod discovers the new pod IP?The config map is updated instantly across all pods. So its purely a matter of how long it takes for the old pods to realize a config map update was done. In my testing that's somewhere between 30-60 seconds (I haven't actually timed it so that's a guesstimate). If there could be some sort of listener for when the config map gets updated that would make this whole thing nearly instantaneous.