#community-help

Troubleshooting Stalled Writes in TypeSense Instance

TLDR Robert was experiencing typesense instances getting stuck after trying to import documents. Kishore Nallan provided suggestions and added specific logs to diagnose the issue. The two identified queries causing troubles but the issues had not been fully resolved yet.

Powered by Struct AI

6

1

Oct 24, 2022 (12 months ago)
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
11:03 PM
I have a typesense instance, no search traffic, that I attempted to import some documents into. The import operation froze up, and the typesense instance is stuck in a state with 'queued_writes: 1'.
11:03
Robert
11:03 PM
I can perform searches from the instance, but the write is stuck and I cannot write anything else to it. I've left it in this state for about 8 hours, no change. /health reports ok:true
11:03
Robert
11:03 PM
Log snippet:
I20221024 14:01:28.460819 36413 batched_indexer.cpp:250] Running GC for aborted requests, req map size: 0
I20221024 14:01:28.501175 36412 raft_server.cpp:534] Term: 2, last_index index: 1014724, committed_index: 1014724, known_applied_index: 1014724, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 303202476
I20221024 14:01:28.501262 36432 raft_server.h:60] Peer refresh succeeded!
I20221024 14:01:38.501965 36412 raft_server.cpp:534] Term: 2, last_index index: 1014724, committed_index: 1014724, known_applied_index: 1014724, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 303202476
I20221024 14:01:38.502049 36592 raft_server.h:60] Peer refresh succeeded!
11:03
Robert
11:03 PM
To troubleshoot, I brought up a brand new typesense instance with an empty data dir, set the same schema and I was able to import the documents just fine.
11:03
Robert
11:03 PM
Next, I made a copy of the data directory of the stuck instance (rsync -avH to handle the hard links properly) and then started up a new typesense process against this data copy. It didn't have any queued writes on startup, so I attempted the document import again. It worked fine.
11:03
Robert
11:03 PM
So I attempted to stop the original stuck typesense instance and upon Ctrl-C it just outputs:
11:03
Robert
11:03 PM
I20221024 18:56:27.481899 36412 raft_server.cpp:778] Waiting for in-flight writes to finish...
I20221024 18:56:27.481905 36412 raft_server.cpp:780] pending_writes: 1
I20221024 18:56:28.482007 36412 raft_server.cpp:780] pending_writes: 1
I20221024 18:56:29.482134 36412 raft_server.cpp:780] pending_writes: 1
11:04
Robert
11:04 PM
I had to use kill -9 to terminate it.
11:04
Robert
11:04 PM
What exactly does 'in-flight writes' mean?
11:04
Robert
11:04 PM
What could cause one to get stuck this way?
11:04
Robert
11:04 PM
How might I avoid it getting stuck in the future?
Oct 25, 2022 (12 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:51 AM
I think there might be an issue that caused the write thread to crash. We have identified and fixed a few edge cases in 0.24 RC which guarantees that this type of bad data issue cannot stall the write thread.
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
12:39 PM
Ahh, great news. Thank you! I’ll hold off on updating for now. See if can muddle through until 0.24 releases. Any rough eta on a 0.24 release? Also, when I do update is it preferred that I start with a fresh DB and re-import the documents or will it be safe to use existing data files from 0.23?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:40 PM
It will be safe to use the 0.23 data: we strive very hard to offer a seamless upgrade.

We are entering a period of code freeze for 0.24 branch. So 2-3 weeks, barring any new major bugs from existing features.

1

Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
08:17 PM
Ugh, it got stuck again, this time with 2 queued writes. Think I'm gonna have to switch over to an 0.24 RC to see if that resolves the problem. I'm looking at https://hub.docker.com/r/typesense/typesense/tags to find the latest RC, should I be using 0.24.0.rcn25 or 0.24.0.rc29 or a different one? Want to make sure I get the one that has the fixes you mentioned.
08:28
Robert
08:28 PM
Ahh, did some searching, looks like I should use the latest rcn.
Oct 26, 2022 (12 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:54 AM
Latest stable is 0.24.0.rcn25
Oct 27, 2022 (12 months ago)
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
07:29 PM
Sad news, this 'stuck' state is happening in 0.24.0.rcn25 as well 😞
07:30
Robert
07:30 PM
Order of events:
1. The instance was serving search traffic
2. I stopped sending search traffic to it entirely
3. I attempted to import documents
4. The server gets stuck with 1 queued_writes and the documents don't import, just hangs
07:30
Robert
07:30 PM
Here is a log snippet:
I20221027 15:19:08.324472 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 0
I20221027 15:19:16.795174 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:16.795248 39850 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:26.795794 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:26.795842 39811 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:36.796442 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:36.796504 39831 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:46.797086 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:46.797152 39740 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:56.797785 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:56.797845 39846 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:06.798342 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:20:06.798398 39804 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:09.327406 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 0
I20221027 15:20:16.798827 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:20:16.798883 39757 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:26.799475 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:26.799559 39788 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:36.800021 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:36.800071 39653 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:46.800534 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:46.800592 39588 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:56.801030 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:56.801095 39781 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:06.801538 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:06.801602 39657 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:10.330410 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 1
I20221027 15:21:16.802002 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:16.802057 39735 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:26.802635 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:26.802727 39691 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:36.803251 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:36.803301 39681 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:46.803797 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:46.803884 39851 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:56.804364 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:56.804412 39655 raft_server.h:60] Peer refresh succeeded!
I20221027 15:22:06.804944 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:22:06.804996 39659 raft_server.h:60] Peer refresh succeeded!
I20221027 15:22:11.333583 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 1
07:30
Robert
07:30 PM
This queued_writes: 1 repeats forever, it's been about an hour so far. It's as if typesense just serves seaches for hours and hours, then stops getting search traffic, it doesn't want to accept imports somtimes?

Server launched with: typesense-server --data-dir=/mnt/data/typesense0 --api-port=14101 --peering-port=14100 --api-key=REDACTED --api-address=127.0.0.1 --peering-address=127.0.0.1 --snapshot-interval-seconds=86400 --num-documents-parallel-load=10000 --thread-pool-size=400

I imagine, like previous times, if I bring down the database and bring it back up, I'll be able to import the documents just fine. Sadly this is like a 3 hour process, so it's something I'd like to avoid as much as possible. Also sonce this happens sporadically I can't provide a test case.

Any idea how we can troubleshoot this further and figure out the cause/solution?
Is there additional logging that can be 'enabled' that will further explain what write is queued or why it's not progressing?
Or any other way to get additional information about it's current 'stuck' state?
Oct 28, 2022 (12 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:53 AM
Ok let me work on getting you a build that has an endpoint that spits out the currently queued document. We can use that to see if the issue can be reproduced.
12:54
Kishore Nallan
12:54 AM
Do you know if the hang happens on the very first document? Or does it fail after some documents have been imported?
12:56
Kishore Nallan
12:56 AM
The other explanation could be that an ongoing search operation somehow has stolen the lock and is not releasing it. Can you post the the output of the /stats endpoint?
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
02:28 AM
Wasn't able to test on the instance that was stuck, I had already restarted the typesense process and as expected all documents then imported just fine.
Funny enough though, my secondary typesense instance just froze up a few moments ago upon trying to import documents.
It locked up on the very first document it was attempting to import, so it didn't even get one in.
Calling curl <http://127.0.0.1:8108/stats.json> yields
{
  "delete_latency_ms": 0,
  "delete_requests_per_second": 0,
  "import_latency_ms": 0,
  "import_requests_per_second": 0,
  "latency_ms": {},
  "pending_write_batches": 1,
  "requests_per_second": {},
  "search_latency_ms": 0,
  "search_requests_per_second": 0,
  "total_requests_per_second": 0.0,
  "write_latency_ms": 0,
  "write_requests_per_second": 0
}

Again, this instance was just doing searches. I diverted search traffic away, waited 2 minutes for searches to finish up and then attempted the import.
02:38
Robert
02:38 AM
Perhaps if /stats would also show what is holding any locks, preventing writes from taking place, that could be useful. Or any other additional logging as to why writes are not moving forward.
02:55
Robert
02:55 AM
I will also spend some time tomorrow to see if I can somehow get a reproducible test case.
02:35
Robert
02:35 PM
So I tried a little test case, just a couple hundred megs of data, re-played 42,000 searches captured from production, then attempted to write to the DB. Worked fine. Now I'm gonna test against the full dataset. This will take a day or 2 to do as each time I load up the server it takes about 3 hours
Oct 29, 2022 (12 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:47 AM
Ok, I got tied up with an issue yesterday so I could not respond. I think our best bet is being able to reproduce the issue, which I think your replay should be able to do. Then I can send a build with more detailed logs which can highlight where it gets stuck. Must be an edge case because we have never had this type of issue show up on any of the thousands of nodes we run on Typesense cloud and this includes some large datasets as well.
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
02:08 AM
No worries about the delay, I really appreciate the help you’ve been giving! So I tested today against the full dataset. Replaying 1,000 then 5,000 then 20,000 searches, then attempting to do a write. It worked fine. So now I am running overnight against the full 43,000 searches from production. I will update in the morning.

1

02:17
Robert
02:17 PM
So the full search/insert worked fine overnight. So I can't reproduce it "on demand". However the live instances (0.24.0.rcn25) are still suffering from this problem every day or so. So if you can get me a build that has more detailed logs about why it's stuck, I can run the production instances with the builds in order to get the logs needed to track down the issue.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:19 PM
Got it. I'm thinking of adding a log at the start and end of each request with the ID and also the actual request at the start. This way we can know if a particular request starts but doesn't complete and hence block the writes.

1

Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
03:00 PM
That sounds good. I'm not sure if TypeSense has any other operations that can prevent writes, a maintenance thread or something, but if so you may want to also log any time a write lock is acquired and once again when released. I don't really know about the internals at all 🙂
Oct 30, 2022 (11 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:22 AM
I've added logging around both reads and writes in this build: typesense/typesense:0.24.0.rcn25-log1

1. For reads, I log the request_id, query/body at the start and the request_id at the end again to ensure that the request is not blocked anywhere.
2. For writes, I log just before the write lock is acquired and after it is released.
This should give us a insight into where it's getting stuck.
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
11:55 AM
Awesome! The next time the DB locks up and I need to restart it, I will bring it up with your new log version and let you know how things go.

1

Nov 03, 2022 (11 months ago)
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
12:00 PM
Just to give you an update. On Oct 31, one of the 2 DB instances locked up, that was still running the old 0.24.0 non-log version. I brought that one down and brought it back up with the log1 version. Haven't had any lock ups since then yet. I'll update again once I have actual lock up logs to share.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:01 PM
👍 Thanks for the update.
Nov 12, 2022 (11 months ago)
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
03:16 AM
The DB locked up, I wrote a script to parse the log and determine if any searches are outstanding. It found a search that didn't finish and testing the search myself, yes, it doesn't ever return results and Typesense is spinning a single core at 100%. I'm working on narrowing down a smaller dataset with reproduction steps that I can submit a issue to github.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:28 AM
Finally! Thank you.
Nov 13, 2022 (11 months ago)
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
02:51 AM
So I identified 2 seperate queries that were running forever.
Query 1 was due to setting split_join_tokens : "always" and by changing that to "fallback" it fixed the issue. I'm not sure if you'd consider that a bug or just a 'risk' with certain datasets. If you do consider that a bug and want me to submit an issue with a test case, let me know. If you consider that just just a risk with this setting, I'd suggest noting in the documentation for split_join_tokens that using "always" can sometimes lead to runaway searches.
Query 2 is a much more normal query that I'm still working on bisecting down a smaller dataset for a github issue.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:54 AM
The thing with split join tokens is that if a relatively rare word gets split into two words, one of which is a really frequently occurring word like a stop word then that can cause a heavy query. What's the query that caused this issue?
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
03:13 AM
The word is q : "amanda" with full query being:
{"query_by":"content,ext,filename,title","facet_by":"itemid,catid,genreid,family,formatid","max_facet_values":300,"page":1,"per_page":100,"include_fields":"filename,title,size,duration,ts,family,formatid,href,contentURL,subCount,width,height,ext,extra,label,itemid","highlight_fields":"none","highlight_affix_num_tokens":0,"search_cutoff_ms":40000,"infix":"off","prioritize_exact_match":true,"use_cache":true,"cache_ttl":3600,"sort_by":"_text_match:desc","prefix":true,"num_typos":2,"typo_tokens_threshold":1,"drop_tokens_threshold":1,"split_join_tokens":"always","q":"amanda","filter_by":"family:[image]"}

Setting num_typos to 0 and both tokens_thresholds to 0 didn't do anything, but as soon as I changed split_join_tokens to fallback, near instant result.
03:18
Robert
03:18 AM
I see a ton of common words within that one: a man and da etc.
It would be cool if split_join_tokens was split into two options split_tokens and join_tokens as I'd probably want join_tokens on whenever doing fuzzy matching, where I'd rarely want split_tokens on, especially since it can lead to a runaway situations such as this one.
03:19
Robert
03:19 AM
It would also be pretty cool if search_cutoff_ms would interrupt the split_join_tokens operation, instead of it potentially running forever.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:21 AM
Great suggestions. I think the issue is not that the operation runs forever but I think there is a particular split word which is really heavy. I might have to get you another build with more logging to see where it gets stuck. Maybe we are not interrupting it at some place.
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
03:23 AM
I can submit a Github issue with a dataset and query that reproduces the runaway behavior, so you can debug/troubleshoot on your side (might be more efficient at narrowing down the problem). Would that be useful?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:25 AM
Yup that would be perfect!
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
03:26 AM
Will do.
03:31
Robert
03:31 AM
By the way, THANK YOU again for being so responsive and for your work on TypeSense. It's really a great piece of software 🙂

1

04:37
Robert
04:37 PM
I just submitted: https://github.com/typesense/typesense/issues/788
It covers the issue with split_join_tokens : "always" causing the search to hang forever. It's a bit lower priority of a bug due to a good workaround by setting split_join_tokens : "fallback"
04:37
Robert
04:37 PM
I'm working on a bug report for the more serious bug, where just a regular query causes the search to hang forever.
08:41
Robert
08:41 PM
I have submitted the second issue: https://github.com/typesense/typesense/issues/789
It covers a forever running search that doesn't seem to ever return. Full test data and reproduction steps included of course.
Nov 14, 2022 (11 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:54 PM
Thank you, I will be taking a look over this week. Given that these data sets are large, the cycle time will probably be a bit longer. I will keep you posted. Appreciate all the help here in narrowing down.

1

Nov 28, 2022 (10 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:25 PM
I will be looking into the search_cutoff not kicking in this week. Sorry about the delay!

1

Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
10:00 PM
No worries, I appreciate the update :)
Nov 30, 2022 (10 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:56 AM
Fixed in typesense/typesense:0.24.0.rcn37 -- this was again because of the large offsets which froze up the search in an unexpected manner. The other issue with phrase search still exists. I will be looking into that as well soon.
Dec 01, 2022 (10 months ago)
Robert
Photo of md5-6384d24e1825271b2c37ad8afa24a899
Robert
05:08 AM
Excellent! Great to hear, thanks!