Hi team,
We're on v29rc14 and have encountered the data corruption issue again. This was previously brought up in the 3 threads below, but starting a new thread here since this was expected to be resolved with v29.
Could you please help us with any advice on the notes below. And more importantly,
any insight on your side as to the cause and why v29 isn't helping?
We have determined a reliable way to detect the issue is occurring, and something of a workaround when it occurs. Here are those details:
1. This is happening today on our "ktgh072..." cluster for many documents (including documentId 4598733).
a. You can find the document by querying the ID, but not by searching for it using other fields as filter_by parameters.
b. We have ~65k of these on that cluster.
c. To assist in troubleshooting, we have left that cluster as is. We spun up a new cluster to use for production, where we will follow the mitigation steps noted below. But for as long as it would be helpful to your team, we can leave this static cluster in place.
2. To detect these:
a. We do a collection export (of just the ID to keep it smaller and faster) and count the results
b. We query the collection and check the "out_of" property
c. The difference seems to indicate the documents that have become corrupted and are not searchable via normal filter_by
3. To resolve this issue, we found 2 approaches:
a. We can drop the collection and rebuild it. This is a little more cumbersome because we have to recreate the collection and update our aliases.
b. We can truncate the collection, and reload it. This is our current approach.
4. We've resorted to checking the counts every hour, so we can know when the problem occurs and decided about triggering our mitigation plan.
a. Our current load process doesn't support loading a new collection and swapping aliases. We are considering taking on the work to mitigate the issue without downtime.
5. Now that we know how to check for the issue and are monitoring, we'll let you know if we see any pattern in the records that are affected or how quickly it builds up.
Thread 1
Thread 2
Thread 3