Hi, we're having trouble with the upgrade of our H...
# community-help
s
Hi, we're having trouble with the upgrade of our HA cluster. We're currently on the free support tier, but the issues with our HA prod cluster hopefully warrant investigation. Our configuration change has been processing for 2.5 hours. The last node to upgrade seemed stuck (the pending write queue was not dropping) and eventually to try and speed things up, we deleted the collections we have on the cluster, hoping that would unstuck the last node and get the change to finish. That did cause the pending write monitoring to drop down to zero, but the cluster still says a configuration change is in progress. I'm nervous about recreating the collections and loading them up again while this change is in progress. Other important context: 1. During our nightly reload of the collections (via the 'import' endpoint and POST method), we received this error: a. Error: i. Converting circular structure to JSON --> starting at object with constructor 'TLSSocket' property '_httpMessage' -> object with constructor 'ClientRequest' property 'socket' closes the circle TypeError: Converting circular structure to JSON --> starting at object with constructor 'TLSSocket' | property '_httpMessage' -> object with constructor 'ClientRequest' --- property 'socket' closes the circle at JSON.stringify .... b. After that, the system received another error on an handled exception which caused our calling API to crash: i. Unhandled exception: write ECONNRESET Error: write ECONNRESET at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js9416) 2. We believed this error was partly related to inadequate CPU resources on the cluster, so we initiated an upgrade. On the monitoring dashboard we could see the first 2 nodes being dropped, recreated, and backfilled with data. But the last node never began processing the pending write queue as that chart stayed pegged for that node. a. As noted above, we eventually decided to drop the collections on this cluster. That cleared the backlog of pending writes to the last node, but the admin panel still says a change is in progress. 3. After we had initiated the configuration change (but before dropping the collections), we noticed strange behaviors in searching our collection. Since this morning's failure, the collection seemed corrupted and no nested fields were working to filter_by. a. 99% of our documents contain a suppliers array with a suppliers.practiceID property set to 0. We found that we could only get documents to return if that filter_by was omitted, or if it was set suppliers.practiceId:!=[0]. Which is the opposite behavior expected from our data. Then we started trying to filter by other nested fields and found that none of them worked properly. 4. This HA cluster is our prod environment. We just finished development work and released this week. We decided to spend a week smoke testing before switching it on for users in our app. a. While we aren't letting users search using TypeSense yet, we are now pushing near-real time updates through out the day, and doing a nightly full refresh. It seems our near-real time updates through the day used our 7 hour CPU burst, and left the nightly refresh starved for resources. That was our working theory in deciding to upgrade. I can provide the cluster identifier if that would help.
👀 1
j
The amount of time it takes for a configuration change to run depends on the volume of data that is indexed in the cluster and the amount of writes that are coming in as the configuration change is in progress. As each node is rotated, we rebuild the in-memory indices from the last snapshot, and then replay the writes that came into the other nodes while the new node was rebuilding indices from the snapshot. In your case, the rebuilding of indices from snapshot took less than 2 minutes, but then your cluster was receiving a ton of single document updates while the config change was happening. So this is what took over 1.5 hours to catch up on with the last node. So writes weren't actually stuck during that stage, it was just trying to catch up, while more writes were piling on. For your volume of writes I would highly recommend using the bulk import endpoint instead of the single document write endpoints, which will speed up writes by over an order of magnitude and prevent this issue from happening
This section of the docs describe how to handle high-volume writes: https://typesense.org/docs/guide/syncing-data-into-typesense.html#high-volume-writes
s
We do use the bulk import option for our nightly full load. We didn't implement that for our near-real time data because it doesn't yet support a filter_by argument. We want to ensure that the update being processed isn't older than than a "lastUpdated" timestamp we maintain on documents. We would prefer to do an upsert if possible. With processing them individually, we can attempt an update with filter_by on documentId and our lastUpdated timestamp. If that returns that no documents were updated, then we no to proceed with a POST to create the document. This is inefficient (because we make multiple calls instead of a single upsert, and because we make calls for each update, rather than in a batch). But is there another approach that uses bulk import and protects against race conditions in writing old data to a newer document?
Separately from the loading strategy, do you have any insight in the initial issue we saw or the bad state it left the collection in? Without access to any logs we're kind of stumped on what happened. But fairly concerned about something happening again. Our best guess is that lack of CPU caused our update to fail. But how that lead to a corrupt collection is a mystery.
And thank you for explaining the sequence of events during the config change. That is very helpful to understand.
j
We want to ensure that the update being processed isn't older than than a "lastUpdated" timestamp we maintain on documents.
I would recommend doing this versioning logic on your side in a buffer table, so the final version is resolved in that buffer table, and then use the bulk import endpoint to import the finally resolved version of the docs into Typesense.
We would prefer to do an upsert if possible.
The bulk import endpoint does support
action=upsert
re: the other inconsistent state you observed, once CPU / RAM resources are exhausted, behavior becomes very inconsistent and hard to explain. So it's best to make sure you have enough capacity to avoid this. What I suspect could have happened in your case is that with CPU capacity exhausted, may be the circuit breaker that limits query execution time kicked in and terminated queries before they completed. But again, once hardware resources are exhausted, even logs start becoming unreliable with no processing capacity available.
fyi - your cluster has been struggling to keep up with the volume of single document updates coming in for the same reasons I mentioned above. Switching to bulk updates using the method above is the only way to keep your cluster stable and within resource usage limits.
s
Thank you for the help @Jason Bosco. We found a bug in our pub/sub system for pushing incremental updates that was generating enormous artificial load. Our write volume is down now, and I think our cluster is stable. We were even able to scale back down to our previous size. I started a new thread, because we are still having trouble with loading the collections. It doesn't appear performance related though, so I didn't carry on here. But linking it anyway for reference. https://typesense-community.slack.com/archives/C01P749MET0/p1741794940982869