Hi, we're having trouble with the upgrade of our H...
# community-help
s
Hi, we're having trouble with the upgrade of our HA cluster. We're currently on the free support tier, but the issues with our HA prod cluster hopefully warrant investigation. Our configuration change has been processing for 2.5 hours. The last node to upgrade seemed stuck (the pending write queue was not dropping) and eventually to try and speed things up, we deleted the collections we have on the cluster, hoping that would unstuck the last node and get the change to finish. That did cause the pending write monitoring to drop down to zero, but the cluster still says a configuration change is in progress. I'm nervous about recreating the collections and loading them up again while this change is in progress. Other important context: 1. During our nightly reload of the collections (via the 'import' endpoint and POST method), we received this error: a. Error: i. Converting circular structure to JSON --> starting at object with constructor 'TLSSocket' property '_httpMessage' -> object with constructor 'ClientRequest' property 'socket' closes the circle TypeError: Converting circular structure to JSON --> starting at object with constructor 'TLSSocket' | property '_httpMessage' -> object with constructor 'ClientRequest' --- property 'socket' closes the circle at JSON.stringify .... b. After that, the system received another error on an handled exception which caused our calling API to crash: i. Unhandled exception: write ECONNRESET Error: write ECONNRESET at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js9416) 2. We believed this error was partly related to inadequate CPU resources on the cluster, so we initiated an upgrade. On the monitoring dashboard we could see the first 2 nodes being dropped, recreated, and backfilled with data. But the last node never began processing the pending write queue as that chart stayed pegged for that node. a. As noted above, we eventually decided to drop the collections on this cluster. That cleared the backlog of pending writes to the last node, but the admin panel still says a change is in progress. 3. After we had initiated the configuration change (but before dropping the collections), we noticed strange behaviors in searching our collection. Since this morning's failure, the collection seemed corrupted and no nested fields were working to filter_by. a. 99% of our documents contain a suppliers array with a suppliers.practiceID property set to 0. We found that we could only get documents to return if that filter_by was omitted, or if it was set suppliers.practiceId:!=[0]. Which is the opposite behavior expected from our data. Then we started trying to filter by other nested fields and found that none of them worked properly. 4. This HA cluster is our prod environment. We just finished development work and released this week. We decided to spend a week smoke testing before switching it on for users in our app. a. While we aren't letting users search using TypeSense yet, we are now pushing near-real time updates through out the day, and doing a nightly full refresh. It seems our near-real time updates through the day used our 7 hour CPU burst, and left the nightly refresh starved for resources. That was our working theory in deciding to upgrade. I can provide the cluster identifier if that would help.
👀 1