We're having trouble with our HA nodes getting out...
# community-help
s
We're having trouble with our HA nodes getting out of sync again. I get different results for the same query when I git 1 of our nodes, even though the documents look the same. This was happening before, and still after a rolling restart I mentioned in an old thread. I think my next step would be to do a nightly refresh which pauses our incremental updates and does batches of import with upsert. But I'm concerned about what went wrong to put us in this state, how to monitor for it next time (so a customer doesn't have to report it), and how reliable fix it if it occurs. Looking at our cluster metrics for the last 7 days, we have no concerning resource spikes.
j
It's the same issue as last time. A JOIN related change in v28.0 is putting one of the nodes in your cluster in a state where the write queue is stuck on just that node due to a deadlock. We're very closer to releasing a patch for this issue. And then this should no longer be an issue. In the meantime, you could query the GET /collections endpoints on each of the individual hostnames periodically and look at the
num_documents
field in the response, to verify the state of the nodes.
s
How do we get it undeadlocked? Is a rolling restart the only option? That seems to have not resolved it today.
j
Hmm if a restart didn't fix it, then this might be something else. Could you share a curl request minus your api key and hostname that returns different results?
s
I was wrong, our num_documents matches now and we get the same results from the 3 nodes. The issue we still see is that some documents seem to be corrupted. These 2 cURLs should return the same document. But the one that includes the catalogId filter returns zero hits. When looking at the document, the catalogId is present. And if we write a new version of the same document with some additional data in the catalogID array, then both versions of the cURL will return the document.
curl --location '<https://hostname.a1.typesense.net/collections/products_prod/documents/search?q=*&filter_by=productId%3A%3D319952>' \
curl --location '<https://hostname.a1.typesense.net/collections/products_prod/documents/search?q=*&filter_by=catalogIds%3A37480&filter_by=productId%3A%3D319952>' \
A redacted version of the document is attached. You can see the catalogId is in the array. None of the items in the array work to filter_by, until we write a new document that adds some new value to the array. Then it works. Updating other parts of the document don't help. We have to add some new value to the catalogId array to get Typesense to acknowledge and of the existing values in it.
The fact that importing the document again, or even updating it with new data not related to the catalogs, isn’t fixing the problem: makes me wonder if there is some internal index related to that catalog array, which is corrupt? And only if we trigger an update to that structure do we resolve the issue? If that’s the case, I was thinking of dropping and rebuilding the whole collection. Is there any investigation you could do against the collection in its current state that I would undermine, if I dropped it and re-created? We are using an alias, so I could make a new clean collection with a new name and leave this defective one there.
j
The second curl command you shared above has two
filter_by
parameters which is invalid syntax.
👀 1
s
curl --location '<https://hostname.a1.typesense.net/collections/products_prod/documents/search?q=&per_page=32&page=1&query_by=barcodes%2Cskus%2Cmanufacturer.mfrNumbers%2Cdescription%2CsupplierDescriptions&filter_by=productId%3A%3D319083%20%26%26%20catalogIds%3A%3D37480>' \
curl --location '<https://hostname.a1.typesense.net/collections/products_prod/documents/search?q=&per_page=32&page=1&query_by=barcodes%2Cskus%2Cmanufacturer.mfrNumbers%2Cdescription%2CsupplierDescriptions&filter_by=productId%3A%3D319083>' \
Same result if I use a single filter_by. On the syntax of filter_by, I've never had a problem applying several filter_by clauses when I'm testing in postman. I didn't realize that could be an issue. In our app, we generate a single filter_by.
We created 2 new collections (primary and joined data) and started loading them with data. Our load job finished but 1 node was showing large pending write count. And our RAM consumption indicated none of the data from the new collections was making it there. And the admin portal wasn't even showing collections. It was like the whole cluster was stuck. I initiated a rolling restart, which is still in progress. The admin portal is responsive again, but nodes aren't consistently responding (probably because the rolling restart is ongoing). Do you have details on the patch that is in the works for deadlocking nodes with joins? Is there anything we can do to mitigate the situation?
j
We have the fix ready, we're now working on adding finishing touches to the test suite to make sure this issue doesn't resurface.
I have a feeling the issue you mentioned earlier might be related as well, so let's triage it after this patch is ready to go
Is there anything we can do to mitigate the situation?
We're manually trying to resync your cluster right now
All nodes are back up now
We were also able to replicate the issue above - somehow only the leader node returns results, the follower nodes doesn't return results for the same query. Once we apply the patch, we'll check this issue again
👍 1
s
Thanks, I’ll check it out tonight and clean up the extra collections we don’t need.
Another odd behavior. I deleted the duplicate collections last night, but our RAM usage didn't decrease. This morning we were still close to max, even though we only had our 2 collections (primary and joined). So I initiated a rolling restart and thought that was freeing up the memory. But the usage continues to climb. Under normal circumstances our collections are around 2GB.
First node to restart eventually ended slightly lower than it was before, but still higher than normal.
Here is the 3 day graph. Our 2 collections are usually steady at 2GB. Yesterday I spun up new collections and then tried to drop the old ones last night. But even after dropping the old collections and performing a rolling restart, we're still very high on RAM. The rolling restart seems to have brough RAM down from 3.6 to about 3.0. So gives us a little more headroom, but not enough.
@Jason Bosco Now with no collections on the cluster, we still see ~700MB RAM usage.
j
with no collections on the cluster, we still see ~700MB RAM usage
Typesense typically reserves some memory for anticipated future use, and doesn't release all memory back to the OS. This is by design for performance reasons. So when you create a new collection and add some data, it will re-use previously reserved RAM. Your other observation about RAM growth is most likely a result of the JOIN issue which has been plaguing us unfortunately. I wouldn't be surprised if the seemingly stuck write queue is just repeatedly trying to allocate small amount of memory on each iteration. The fix should be out in the next week 🤞
s
Is the JOIN issue specific to HA clusters? If we switch to a single node should it be stable?
j
From the symptoms we've observed on a few clusters, it seems to only affect followers yes. So yeah might be worth trying on a single node cluster in parallel
👍 1
s
The nodes on the admin page are periodically flashing unhealthy, then back to healthy. So yes, there is still something stuck.
j
@Scott Nei We just published v29.0.rc10 which has a potential fix. Could you upgrade your cluster to this version and let me know if you still observe the other issues you reported subsequently?
g
Hi @Jason Bosco. My name is Guillermo, and I am @Scott Nei 's colleague at Method USA. Scott has taken some days off, and he'll return next week. Meanwhile, as you know, we moved to a new cluster with only one instance in our production environment, hoping to avoid the inconsistency issues Scott reported on March 27. Unfortunately, yesterday, we confirmed that we are experiencing the same behavior again: The issue we see is that documents seem to be "corrupted". Not all of them, but probably thousands. Among all of them, by coincidence or not, the same document (ID) is failing again on this new collection. You can check it using the same cURL commands:
curl --location '<https://hostname.a1.typesense.net/collections/products_prod/documents/search?q=&per_page=32&page=1&query_by=barcodes%2Cskus%2Cmanufacturer.mfrNumbers%2Cdescription%2CsupplierDescriptions&filter_by=productId%3A%3D319083%20%26%26%20catalogIds%3A%3D37480>' \
curl --location '<https://hostname.a1.typesense.net/collections/products_prod/documents/search?q=&per_page=32&page=1&query_by=barcodes%2Cskus%2Cmanufacturer.mfrNumbers%2Cdescription%2CsupplierDescriptions&filter_by=productId%3A%3D319083>' \
I am pointing this out to you just because if you were thinking the problem was limited to multi-instance clusters (only affecting the followers), it is not, making it harder for us to mitigate in our Production environment (even our current daily full-sync is not fixing this situation). We are thinking of creating new collections and loading all the documents. Unfortunately, this is the only way we have found to solve this. So, we would like to know: 1) Is it worth keeping these "corrupted" collections to let you investigate this issue? If we do this, we'll consume more RAM in our Production server, and we do not want to do this unnecessarily. 2) Could you share this situation internally so as to confirm you will be finding and fixing the root cause of this? we are extremely concerned since we are experiencing these issues in our Prod env (we are not able to do a downgrade since we are using features included in v28), and they are affecting us seriously. Thanks for your help. Any suggestions will be welcome.
j
Hey Guillermo, both your single node clusters are still running v28.0. Could you upgrade v29.0.rc12 (which is what has the fix) and then try again, and let me know if you still see the same issue?
g
Hi Jason! I will comment on this with the team. We are very restricted since this behavior has only happened in our PROD env. In any case, will be aware of any new situation, and we'll share with you if we find it useful to identify the cause of this issue.
Hi Jason, we upgraded today our prod cluster to v.29.0.rc12 We'll LYK if the reported issue appears again.
(Note: I do not know if this is normal or not, but I would like to mention that the node went from "Healthy" to "Unhealthy" and back to "Healthy" about a dozen times after the config was applied. Please check the log for details if you do not find this normal)
s
@Jason Bosco We haven't seen any data corruption on v29.0.rc12. We'll keep checking daily to see if it gets triggered this week. Also, I see there is a rc13 now. Is it safe or advisable to upgrade to that version? I'm assuming we should keep up on the latest release candidates as they drop?
j
Great to hear that the issue hasn't resurfaced. Most RC builds are safe to upgrade to as they come out, since we've addressed more issues on the latest RC, than previous GA / RC releases, especially if you're using new features
👍 1
j
Hey, chiming in also from v29.rc12. We run 4x3 servers (that's four typesense servers, replicated across 3 nodes) and have been having issues with syncing as was noted here. We decided to run v29.0.rc12 and are also seeing the servers go from healthy to unhealthy quite a number of times, so we wanted to chime in. I can provide logs and whatever's needed to troubleshoot. FWIW, the servers were upgraded from v28 to v29.
Copy code
I20250416 02:54:31.262137  7119 raft_server.cpp:1144] Timed snapshot succeeded!
I20250416 02:54:31.266991  7121 log.cpp:1150] log save_meta /searchdata/typesense-parker/data/state/log/log_meta first_log_index: 1757608 time: 4861
I20250416 02:54:31.457677  7052 raft_server.cpp:921] Dummy write to <http://10.142.0.6:6108/health>, status = 200, response = {"ok":true}
I20250416 02:54:31.457782  7052 raft_server.cpp:527] save_snapshot done
I20250416 02:54:41.727140  7005 raft_server.cpp:692] Term: 11, pending_queue: 0, last_index: 1757616, committed: 1757616, known_applied: 1757616, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 44464916
I20250416 02:54:52.148834  7005 raft_server.cpp:692] Term: 11, pending_queue: 0, last_index: 1757616, committed: 1757616, known_applied: 1757616, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 44464916
E20250416 02:54:59.571058  7005 http_client.cpp:231] CURL failed. Code: 56, strerror: Failure when receiving data from the peer, method: GET, url: <http://10.142.0.6:6108/status>
E20250416 02:54:59.571285  7005 raft_server.cpp:828] Error, /status end-point returned bad status code 500
Log file created at: 2025/04/16 02:55:00
Running on machine: xavier
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20250416 02:55:00.787528 29312 typesense_server_utils.cpp:353] Starting Typesense nightly
I20250416 02:55:00.787658 29312 typesense_server_utils.cpp:356] Typesense is using jemalloc.
those logs caught me eye, the other 2 servers that also failed at the same time, don't have any specific error
Our hiccups might've been due to lack of disk space. For some reason the typesense server just kept on writing on the data folder. I ended up deleting the data folder and letting it sync "naturally" and there's currently over 300GB left
k
You can compact the DB by calling the db compaction API: https://typesense.org/docs/28.0/api/cluster-operations.html#compacting-the-on-disk-database After calling this API, disk space will decrease in a few hours.
j
thanks! I forgot that was an option, I'll do it on the other servers rather than brute-forcing them into compliance
This time, disk space definitely isn't an issue and one of the servers just restarted itself
I do see a ton of Wrong reference stored for facet 0.14655 with facet_id 1041633584
but this is during the startup, not before it died