I have a typesense instance, no search traffic, th...
# community-help
r
I have a typesense instance, no search traffic, that I attempted to import some documents into. The import operation froze up, and the typesense instance is stuck in a state with 'queued_writes: 1'.
I can perform searches from the instance, but the write is stuck and I cannot write anything else to it. I've left it in this state for about 8 hours, no change. /health reports ok:true
Log snippet:
Copy code
I20221024 14:01:28.460819 36413 batched_indexer.cpp:250] Running GC for aborted requests, req map size: 0
I20221024 14:01:28.501175 36412 raft_server.cpp:534] Term: 2, last_index index: 1014724, committed_index: 1014724, known_applied_index: 1014724, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 303202476
I20221024 14:01:28.501262 36432 raft_server.h:60] Peer refresh succeeded!
I20221024 14:01:38.501965 36412 raft_server.cpp:534] Term: 2, last_index index: 1014724, committed_index: 1014724, known_applied_index: 1014724, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 303202476
I20221024 14:01:38.502049 36592 raft_server.h:60] Peer refresh succeeded!
To troubleshoot, I brought up a brand new typesense instance with an empty data dir, set the same schema and I was able to import the documents just fine.
Next, I made a copy of the data directory of the stuck instance (rsync -avH to handle the hard links properly) and then started up a new typesense process against this data copy. It didn't have any queued writes on startup, so I attempted the document import again. It worked fine.
So I attempted to stop the original stuck typesense instance and upon Ctrl-C it just outputs:
Copy code
I20221024 18:56:27.481899 36412 raft_server.cpp:778] Waiting for in-flight writes to finish...
I20221024 18:56:27.481905 36412 raft_server.cpp:780] pending_writes: 1
I20221024 18:56:28.482007 36412 raft_server.cpp:780] pending_writes: 1
I20221024 18:56:29.482134 36412 raft_server.cpp:780] pending_writes: 1
I had to use
kill -9
to terminate it.
What exactly does 'in-flight writes' mean?
What could cause one to get stuck this way?
How might I avoid it getting stuck in the future?
k
I think there might be an issue that caused the write thread to crash. We have identified and fixed a few edge cases in 0.24 RC which guarantees that this type of bad data issue cannot stall the write thread.
r
Ahh, great news. Thank you! I’ll hold off on updating for now. See if can muddle through until 0.24 releases. Any rough eta on a 0.24 release? Also, when I do update is it preferred that I start with a fresh DB and re-import the documents or will it be safe to use existing data files from 0.23?
k
It will be safe to use the 0.23 data: we strive very hard to offer a seamless upgrade. We are entering a period of code freeze for 0.24 branch. So 2-3 weeks, barring any new major bugs from existing features.
👍 1
r
Ugh, it got stuck again, this time with 2 queued writes. Think I'm gonna have to switch over to an 0.24 RC to see if that resolves the problem. I'm looking at https://hub.docker.com/r/typesense/typesense/tags to find the latest RC, should I be using 0.24.0.rcn25 or 0.24.0.rc29 or a different one? Want to make sure I get the one that has the fixes you mentioned.
Ahh, did some searching, looks like I should use the latest rcn.
k
Latest stable is 0.24.0.rcn25
r
Sad news, this 'stuck' state is happening in 0.24.0.rcn25 as well 😞
Order of events: 1. The instance was serving search traffic 2. I stopped sending search traffic to it entirely 3. I attempted to import documents 4. The server gets stuck with 1 queued_writes and the documents don't import, just hangs
Here is a log snippet:
Copy code
I20221027 15:19:08.324472 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 0
I20221027 15:19:16.795174 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:16.795248 39850 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:26.795794 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:26.795842 39811 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:36.796442 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:36.796504 39831 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:46.797086 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:46.797152 39740 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:56.797785 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:56.797845 39846 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:06.798342 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:20:06.798398 39804 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:09.327406 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 0
I20221027 15:20:16.798827 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:20:16.798883 39757 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:26.799475 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:26.799559 39788 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:36.800021 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:36.800071 39653 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:46.800534 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:46.800592 39588 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:56.801030 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:56.801095 39781 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:06.801538 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:06.801602 39657 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:10.330410 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 1
I20221027 15:21:16.802002 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:16.802057 39735 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:26.802635 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:26.802727 39691 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:36.803251 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:36.803301 39681 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:46.803797 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:46.803884 39851 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:56.804364 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:56.804412 39655 raft_server.h:60] Peer refresh succeeded!
I20221027 15:22:06.804944 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:22:06.804996 39659 raft_server.h:60] Peer refresh succeeded!
I20221027 15:22:11.333583 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 1
This
queued_writes: 1
repeats forever, it's been about an hour so far. It's as if typesense just serves seaches for hours and hours, then stops getting search traffic, it doesn't want to accept imports somtimes? Server launched with:
typesense-server --data-dir=/mnt/data/typesense0 --api-port=14101 --peering-port=14100 --api-key=REDACTED --api-address=127.0.0.1 --peering-address=127.0.0.1 --snapshot-interval-seconds=86400 --num-documents-parallel-load=10000 --thread-pool-size=400
I imagine, like previous times, if I bring down the database and bring it back up, I'll be able to import the documents just fine. Sadly this is like a 3 hour process, so it's something I'd like to avoid as much as possible. Also sonce this happens sporadically I can't provide a test case. Any idea how we can troubleshoot this further and figure out the cause/solution? Is there additional logging that can be 'enabled' that will further explain what write is queued or why it's not progressing? Or any other way to get additional information about it's current 'stuck' state?
k
Ok let me work on getting you a build that has an endpoint that spits out the currently queued document. We can use that to see if the issue can be reproduced.
Do you know if the hang happens on the very first document? Or does it fail after some documents have been imported?
The other explanation could be that an ongoing search operation somehow has stolen the lock and is not releasing it. Can you post the the output of the /stats endpoint?
r
Wasn't able to test on the instance that was stuck, I had already restarted the typesense process and as expected all documents then imported just fine. Funny enough though, my secondary typesense instance just froze up a few moments ago upon trying to import documents. It locked up on the very first document it was attempting to import, so it didn't even get one in. Calling
curl <http://127.0.0.1:8108/stats.json>
yields
Copy code
{
  "delete_latency_ms": 0,
  "delete_requests_per_second": 0,
  "import_latency_ms": 0,
  "import_requests_per_second": 0,
  "latency_ms": {},
  "pending_write_batches": 1,
  "requests_per_second": {},
  "search_latency_ms": 0,
  "search_requests_per_second": 0,
  "total_requests_per_second": 0,
  "write_latency_ms": 0,
  "write_requests_per_second": 0
}
Again, this instance was just doing searches. I diverted search traffic away, waited 2 minutes for searches to finish up and then attempted the import.
Perhaps if /stats would also show what is holding any locks, preventing writes from taking place, that could be useful. Or any other additional logging as to why writes are not moving forward.
I will also spend some time tomorrow to see if I can somehow get a reproducible test case.
So I tried a little test case, just a couple hundred megs of data, re-played 42,000 searches captured from production, then attempted to write to the DB. Worked fine. Now I'm gonna test against the full dataset. This will take a day or 2 to do as each time I load up the server it takes about 3 hours
k
Ok, I got tied up with an issue yesterday so I could not respond. I think our best bet is being able to reproduce the issue, which I think your replay should be able to do. Then I can send a build with more detailed logs which can highlight where it gets stuck. Must be an edge case because we have never had this type of issue show up on any of the thousands of nodes we run on Typesense cloud and this includes some large datasets as well.
r
No worries about the delay, I really appreciate the help you’ve been giving! So I tested today against the full dataset. Replaying 1,000 then 5,000 then 20,000 searches, then attempting to do a write. It worked fine. So now I am running overnight against the full 43,000 searches from production. I will update in the morning.
👍 1
So the full search/insert worked fine overnight. So I can't reproduce it "on demand". However the live instances (0.24.0.rcn25) are still suffering from this problem every day or so. So if you can get me a build that has more detailed logs about why it's stuck, I can run the production instances with the builds in order to get the logs needed to track down the issue.
k
Got it. I'm thinking of adding a log at the start and end of each request with the ID and also the actual request at the start. This way we can know if a particular request starts but doesn't complete and hence block the writes.
👍 1
r
That sounds good. I'm not sure if TypeSense has any other operations that can prevent writes, a maintenance thread or something, but if so you may want to also log any time a write lock is acquired and once again when released. I don't really know about the internals at all 🙂
k
I've added logging around both reads and writes in this build:
typesense/typesense:0.24.0.rcn25-log1
1. For reads, I log the request_id, query/body at the start and the request_id at the end again to ensure that the request is not blocked anywhere. 2. For writes, I log just before the write lock is acquired and after it is released. This should give us a insight into where it's getting stuck.
r
Awesome! The next time the DB locks up and I need to restart it, I will bring it up with your new log version and let you know how things go.
👍 1
Just to give you an update. On Oct 31, one of the 2 DB instances locked up, that was still running the old 0.24.0 non-log version. I brought that one down and brought it back up with the
log1
version. Haven't had any lock ups since then yet. I'll update again once I have actual lock up logs to share.
k
👍 Thanks for the update.
r
The DB locked up, I wrote a script to parse the log and determine if any searches are outstanding. It found a search that didn't finish and testing the search myself, yes, it doesn't ever return results and Typesense is spinning a single core at 100%. I'm working on narrowing down a smaller dataset with reproduction steps that I can submit a issue to github.
k
Finally! Thank you.
r
So I identified 2 seperate queries that were running forever. Query 1 was due to setting
split_join_tokens : "always"
and by changing that to
"fallback"
it fixed the issue. I'm not sure if you'd consider that a bug or just a 'risk' with certain datasets. If you do consider that a bug and want me to submit an issue with a test case, let me know. If you consider that just just a risk with this setting, I'd suggest noting in the documentation for split_join_tokens that using "always" can sometimes lead to runaway searches. Query 2 is a much more normal query that I'm still working on bisecting down a smaller dataset for a github issue.
k
The thing with split join tokens is that if a relatively rare word gets split into two words, one of which is a really frequently occurring word like a stop word then that can cause a heavy query. What's the query that caused this issue?
r
The word is
q : "amanda"
with full query being:
Copy code
{
  "query_by": "content,ext,filename,title",
  "facet_by": "itemid,catid,genreid,family,formatid",
  "max_facet_values": 300,
  "page": 1,
  "per_page": 100,
  "include_fields": "filename,title,size,duration,ts,family,formatid,href,contentURL,subCount,width,height,ext,extra,label,itemid",
  "highlight_fields": "none",
  "highlight_affix_num_tokens": 0,
  "search_cutoff_ms": 40000,
  "infix": "off",
  "prioritize_exact_match": true,
  "use_cache": true,
  "cache_ttl": 3600,
  "sort_by": "_text_match:desc",
  "prefix": true,
  "num_typos": 2,
  "typo_tokens_threshold": 1,
  "drop_tokens_threshold": 1,
  "split_join_tokens": "always",
  "q": "amanda",
  "filter_by": "family:[image]"
}
Setting num_typos to 0 and both tokens_thresholds to 0 didn't do anything, but as soon as I changed split_join_tokens to fallback, near instant result.
I see a ton of common words within that one:
a
man
and
da
etc. It would be cool if
split_join_tokens
was split into two options
split_tokens
and
join_tokens
as I'd probably want
join_tokens
on whenever doing fuzzy matching, where I'd rarely want
split_tokens
on, especially since it can lead to a runaway situations such as this one.
It would also be pretty cool if
search_cutoff_ms
would interrupt the
split_join_tokens
operation, instead of it potentially running forever.
k
Great suggestions. I think the issue is not that the operation runs forever but I think there is a particular split word which is really heavy. I might have to get you another build with more logging to see where it gets stuck. Maybe we are not interrupting it at some place.
r
I can submit a Github issue with a dataset and query that reproduces the runaway behavior, so you can debug/troubleshoot on your side (might be more efficient at narrowing down the problem). Would that be useful?
k
Yup that would be perfect!
r
Will do.
By the way, THANK YOU again for being so responsive and for your work on TypeSense. It's really a great piece of software 🙂
🙌 1
I just submitted: https://github.com/typesense/typesense/issues/788 It covers the issue with
split_join_tokens : "always"
causing the search to hang forever. It's a bit lower priority of a bug due to a good workaround by setting
split_join_tokens : "fallback"
I'm working on a bug report for the more serious bug, where just a regular query causes the search to hang forever.
I have submitted the second issue: https://github.com/typesense/typesense/issues/789 It covers a forever running search that doesn't seem to ever return. Full test data and reproduction steps included of course.
k
Thank you, I will be taking a look over this week. Given that these data sets are large, the cycle time will probably be a bit longer. I will keep you posted. Appreciate all the help here in narrowing down.
👍 1
I will be looking into the search_cutoff not kicking in this week. Sorry about the delay!
👍 1
r
No worries, I appreciate the update :)
k
Fixed in
typesense/typesense:0.24.0.rcn37
-- this was again because of the large offsets which froze up the search in an unexpected manner. The other issue with phrase search still exists. I will be looking into that as well soon.
r
Excellent! Great to hear, thanks!