We're having trouble with our HA nodes getting out of sync again. I get different results for the same query when I git 1 of our nodes, even though the documents look the same. This was happening before, and still after a rolling restart I mentioned in an
old thread.
I think my next step would be to do a nightly refresh which pauses our incremental updates and does batches of import with upsert. But I'm concerned about what went wrong to put us in this state, how to monitor for it next time (so a customer doesn't have to report it), and how reliable fix it if it occurs.
Looking at our cluster metrics for the last 7 days, we have no concerning resource spikes.