I have a typesense instance no search traffic that I attempt typesense #community-help

I have a typesense instance, no search traffic, th...

Robert Schultz

10/24/2022, 11:03 PM

I have a typesense instance, no search traffic, that I attempted to import some documents into. The import operation froze up, and the typesense instance is stuck in a state with 'queued_writes: 1'.

Robert Schultz

10/24/2022, 11:03 PM

I can perform searches from the instance, but the write is stuck and I cannot write anything else to it. I've left it in this state for about 8 hours, no change. /health reports ok:true

Robert Schultz

10/24/2022, 11:03 PM

Log snippet:

Copy code

I20221024 14:01:28.460819 36413 batched_indexer.cpp:250] Running GC for aborted requests, req map size: 0
I20221024 14:01:28.501175 36412 raft_server.cpp:534] Term: 2, last_index index: 1014724, committed_index: 1014724, known_applied_index: 1014724, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 303202476
I20221024 14:01:28.501262 36432 raft_server.h:60] Peer refresh succeeded!
I20221024 14:01:38.501965 36412 raft_server.cpp:534] Term: 2, last_index index: 1014724, committed_index: 1014724, known_applied_index: 1014724, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 303202476
I20221024 14:01:38.502049 36592 raft_server.h:60] Peer refresh succeeded!

Robert Schultz

10/24/2022, 11:03 PM

To troubleshoot, I brought up a brand new typesense instance with an empty data dir, set the same schema and I was able to import the documents just fine.

Robert Schultz

10/24/2022, 11:03 PM

Next, I made a copy of the data directory of the stuck instance (rsync -avH to handle the hard links properly) and then started up a new typesense process against this data copy. It didn't have any queued writes on startup, so I attempted the document import again. It worked fine.

Robert Schultz

10/24/2022, 11:03 PM

So I attempted to stop the original stuck typesense instance and upon Ctrl-C it just outputs:

Robert Schultz

10/24/2022, 11:03 PM

Copy code

I20221024 18:56:27.481899 36412 raft_server.cpp:778] Waiting for in-flight writes to finish...
I20221024 18:56:27.481905 36412 raft_server.cpp:780] pending_writes: 1
I20221024 18:56:28.482007 36412 raft_server.cpp:780] pending_writes: 1
I20221024 18:56:29.482134 36412 raft_server.cpp:780] pending_writes: 1

Robert Schultz

10/24/2022, 11:04 PM

I had to use

kill -9

to terminate it.

Robert Schultz

10/24/2022, 11:04 PM

What exactly does 'in-flight writes' mean?

Robert Schultz

10/24/2022, 11:04 PM

What could cause one to get stuck this way?

Robert Schultz

10/24/2022, 11:04 PM

How might I avoid it getting stuck in the future?

Kishore Nallan

10/25/2022, 1:51 AM

I think there might be an issue that caused the write thread to crash. We have identified and fixed a few edge cases in 0.24 RC which guarantees that this type of bad data issue cannot stall the write thread.

Robert Schultz

10/25/2022, 12:39 PM

Ahh, great news. Thank you! I’ll hold off on updating for now. See if can muddle through until 0.24 releases. Any rough eta on a 0.24 release? Also, when I do update is it preferred that I start with a fresh DB and re-import the documents or will it be safe to use existing data files from 0.23?

Kishore Nallan

10/25/2022, 12:40 PM

It will be safe to use the 0.23 data: we strive very hard to offer a seamless upgrade. We are entering a period of code freeze for 0.24 branch. So 2-3 weeks, barring any new major bugs from existing features.

👍 1

Robert Schultz

10/25/2022, 8:17 PM

Ugh, it got stuck again, this time with 2 queued writes. Think I'm gonna have to switch over to an 0.24 RC to see if that resolves the problem. I'm looking at https://hub.docker.com/r/typesense/typesense/tags to find the latest RC, should I be using 0.24.0.rcn25 or 0.24.0.rc29 or a different one? Want to make sure I get the one that has the fixes you mentioned.

Robert Schultz

10/25/2022, 8:28 PM

Ahh, did some searching, looks like I should use the latest rcn.

Kishore Nallan

10/26/2022, 12:54 AM

Latest stable is 0.24.0.rcn25

Robert Schultz

10/27/2022, 7:29 PM

Sad news, this 'stuck' state is happening in 0.24.0.rcn25 as well 😞

Robert Schultz

10/27/2022, 7:30 PM

Order of events: 1. The instance was serving search traffic 2. I stopped sending search traffic to it entirely 3. I attempted to import documents 4. The server gets stuck with 1 queued_writes and the documents don't import, just hangs

Robert Schultz

10/27/2022, 7:30 PM

Here is a log snippet:

Copy code

I20221027 15:19:08.324472 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 0
I20221027 15:19:16.795174 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:16.795248 39850 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:26.795794 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:26.795842 39811 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:36.796442 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:36.796504 39831 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:46.797086 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:46.797152 39740 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:56.797785 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:56.797845 39846 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:06.798342 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:20:06.798398 39804 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:09.327406 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 0
I20221027 15:20:16.798827 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:20:16.798883 39757 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:26.799475 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:26.799559 39788 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:36.800021 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:36.800071 39653 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:46.800534 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:46.800592 39588 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:56.801030 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:56.801095 39781 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:06.801538 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:06.801602 39657 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:10.330410 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 1
I20221027 15:21:16.802002 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:16.802057 39735 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:26.802635 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:26.802727 39691 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:36.803251 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:36.803301 39681 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:46.803797 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:46.803884 39851 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:56.804364 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:56.804412 39655 raft_server.h:60] Peer refresh succeeded!
I20221027 15:22:06.804944 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:22:06.804996 39659 raft_server.h:60] Peer refresh succeeded!
I20221027 15:22:11.333583 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 1

Robert Schultz

10/27/2022, 7:30 PM

This

queued_writes: 1

repeats forever, it's been about an hour so far. It's as if typesense just serves seaches for hours and hours, then stops getting search traffic, it doesn't want to accept imports somtimes? Server launched with:

typesense-server --data-dir=/mnt/data/typesense0 --api-port=14101 --peering-port=14100 --api-key=REDACTED --api-address=127.0.0.1 --peering-address=127.0.0.1 --snapshot-interval-seconds=86400 --num-documents-parallel-load=10000 --thread-pool-size=400

I imagine, like previous times, if I bring down the database and bring it back up, I'll be able to import the documents just fine. Sadly this is like a 3 hour process, so it's something I'd like to avoid as much as possible. Also sonce this happens sporadically I can't provide a test case. Any idea how we can troubleshoot this further and figure out the cause/solution? Is there additional logging that can be 'enabled' that will further explain what write is queued or why it's not progressing? Or any other way to get additional information about it's current 'stuck' state?

Kishore Nallan

10/28/2022, 12:53 AM

Ok let me work on getting you a build that has an endpoint that spits out the currently queued document. We can use that to see if the issue can be reproduced.

Kishore Nallan

10/28/2022, 12:54 AM

Do you know if the hang happens on the very first document? Or does it fail after some documents have been imported?

Kishore Nallan

10/28/2022, 12:56 AM

The other explanation could be that an ongoing search operation somehow has stolen the lock and is not releasing it. Can you post the the output of the /stats endpoint?

Robert Schultz

10/28/2022, 2:28 AM

Wasn't able to test on the instance that was stuck, I had already restarted the typesense process and as expected all documents then imported just fine. Funny enough though, my secondary typesense instance just froze up a few moments ago upon trying to import documents. It locked up on the very first document it was attempting to import, so it didn't even get one in. Calling

curl <http://127.0.0.1:8108/stats.json>

yields

Copy code

{
  "delete_latency_ms": 0,
  "delete_requests_per_second": 0,
  "import_latency_ms": 0,
  "import_requests_per_second": 0,
  "latency_ms": {},
  "pending_write_batches": 1,
  "requests_per_second": {},
  "search_latency_ms": 0,
  "search_requests_per_second": 0,
  "total_requests_per_second": 0,
  "write_latency_ms": 0,
  "write_requests_per_second": 0
}

Again, this instance was just doing searches. I diverted search traffic away, waited 2 minutes for searches to finish up and then attempted the import.

Robert Schultz

10/28/2022, 2:38 AM

Perhaps if /stats would also show what is holding any locks, preventing writes from taking place, that could be useful. Or any other additional logging as to why writes are not moving forward.

Robert Schultz

10/28/2022, 2:55 AM

I will also spend some time tomorrow to see if I can somehow get a reproducible test case.

Robert Schultz

10/28/2022, 2:35 PM

So I tried a little test case, just a couple hundred megs of data, re-played 42,000 searches captured from production, then attempted to write to the DB. Worked fine. Now I'm gonna test against the full dataset. This will take a day or 2 to do as each time I load up the server it takes about 3 hours

Kishore Nallan

10/29/2022, 1:47 AM

Ok, I got tied up with an issue yesterday so I could not respond. I think our best bet is being able to reproduce the issue, which I think your replay should be able to do. Then I can send a build with more detailed logs which can highlight where it gets stuck. Must be an edge case because we have never had this type of issue show up on any of the thousands of nodes we run on Typesense cloud and this includes some large datasets as well.

Robert Schultz

10/29/2022, 2:08 AM

No worries about the delay, I really appreciate the help you’ve been giving! So I tested today against the full dataset. Replaying 1,000 then 5,000 then 20,000 searches, then attempting to do a write. It worked fine. So now I am running overnight against the full 43,000 searches from production. I will update in the morning.

👍 1

Robert Schultz

10/29/2022, 2:17 PM

So the full search/insert worked fine overnight. So I can't reproduce it "on demand". However the live instances (0.24.0.rcn25) are still suffering from this problem every day or so. So if you can get me a build that has more detailed logs about why it's stuck, I can run the production instances with the builds in order to get the logs needed to track down the issue.

Kishore Nallan

10/29/2022, 2:19 PM

Got it. I'm thinking of adding a log at the start and end of each request with the ID and also the actual request at the start. This way we can know if a particular request starts but doesn't complete and hence block the writes.

👍 1

Robert Schultz

10/29/2022, 3:00 PM

That sounds good. I'm not sure if TypeSense has any other operations that can prevent writes, a maintenance thread or something, but if so you may want to also log any time a write lock is acquired and once again when released. I don't really know about the internals at all 🙂

Kishore Nallan

10/30/2022, 11:22 AM

I've added logging around both reads and writes in this build:

typesense/typesense:0.24.0.rcn25-log1

1. For reads, I log the request_id, query/body at the start and the request_id at the end again to ensure that the request is not blocked anywhere. 2. For writes, I log just before the write lock is acquired and after it is released. This should give us a insight into where it's getting stuck.

Robert Schultz

10/30/2022, 11:55 AM

Awesome! The next time the DB locks up and I need to restart it, I will bring it up with your new log version and let you know how things go.

👍 1

Robert Schultz

11/03/2022, 12:00 PM

Just to give you an update. On Oct 31, one of the 2 DB instances locked up, that was still running the old 0.24.0 non-log version. I brought that one down and brought it back up with the

log1

version. Haven't had any lock ups since then yet. I'll update again once I have actual lock up logs to share.

Kishore Nallan

11/03/2022, 12:01 PM

👍 Thanks for the update.

Robert Schultz

11/12/2022, 3:16 AM

The DB locked up, I wrote a script to parse the log and determine if any searches are outstanding. It found a search that didn't finish and testing the search myself, yes, it doesn't ever return results and Typesense is spinning a single core at 100%. I'm working on narrowing down a smaller dataset with reproduction steps that I can submit a issue to github.

Kishore Nallan

11/12/2022, 3:28 AM

Finally! Thank you.

Robert Schultz

11/13/2022, 2:51 AM

So I identified 2 seperate queries that were running forever. Query 1 was due to setting

split_join_tokens : "always"

and by changing that to

"fallback"

it fixed the issue. I'm not sure if you'd consider that a bug or just a 'risk' with certain datasets. If you do consider that a bug and want me to submit an issue with a test case, let me know. If you consider that just just a risk with this setting, I'd suggest noting in the documentation for split_join_tokens that using "always" can sometimes lead to runaway searches. Query 2 is a much more normal query that I'm still working on bisecting down a smaller dataset for a github issue.

Kishore Nallan

11/13/2022, 2:54 AM

The thing with split join tokens is that if a relatively rare word gets split into two words, one of which is a really frequently occurring word like a stop word then that can cause a heavy query. What's the query that caused this issue?

Robert Schultz

11/13/2022, 3:13 AM

The word is

q : "amanda"

with full query being:

Copy code

{
  "query_by": "content,ext,filename,title",
  "facet_by": "itemid,catid,genreid,family,formatid",
  "max_facet_values": 300,
  "page": 1,
  "per_page": 100,
  "include_fields": "filename,title,size,duration,ts,family,formatid,href,contentURL,subCount,width,height,ext,extra,label,itemid",
  "highlight_fields": "none",
  "highlight_affix_num_tokens": 0,
  "search_cutoff_ms": 40000,
  "infix": "off",
  "prioritize_exact_match": true,
  "use_cache": true,
  "cache_ttl": 3600,
  "sort_by": "_text_match:desc",
  "prefix": true,
  "num_typos": 2,
  "typo_tokens_threshold": 1,
  "drop_tokens_threshold": 1,
  "split_join_tokens": "always",
  "q": "amanda",
  "filter_by": "family:[image]"
}

Setting num_typos to 0 and both tokens_thresholds to 0 didn't do anything, but as soon as I changed split_join_tokens to fallback, near instant result.

Robert Schultz

11/13/2022, 3:18 AM

I see a ton of common words within that one:

man

and

da

etc. It would be cool if

split_join_tokens

was split into two options

split_tokens

and

join_tokens

as I'd probably want

join_tokens

on whenever doing fuzzy matching, where I'd rarely want

split_tokens

on, especially since it can lead to a runaway situations such as this one.

Robert Schultz

11/13/2022, 3:19 AM

It would also be pretty cool if

search_cutoff_ms

would interrupt the

split_join_tokens

operation, instead of it potentially running forever.

Kishore Nallan

11/13/2022, 3:21 AM

Great suggestions. I think the issue is not that the operation runs forever but I think there is a particular split word which is really heavy. I might have to get you another build with more logging to see where it gets stuck. Maybe we are not interrupting it at some place.

Robert Schultz

11/13/2022, 3:23 AM

I can submit a Github issue with a dataset and query that reproduces the runaway behavior, so you can debug/troubleshoot on your side (might be more efficient at narrowing down the problem). Would that be useful?

Kishore Nallan

11/13/2022, 3:25 AM

Yup that would be perfect!

Robert Schultz

11/13/2022, 3:26 AM

Will do.

Robert Schultz

11/13/2022, 3:31 AM

By the way, THANK YOU again for being so responsive and for your work on TypeSense. It's really a great piece of software 🙂

🙌 1

Robert Schultz

11/13/2022, 4:37 PM

I just submitted: https://github.com/typesense/typesense/issues/788 It covers the issue with

split_join_tokens : "always"

causing the search to hang forever. It's a bit lower priority of a bug due to a good workaround by setting

split_join_tokens : "fallback"

Robert Schultz

11/13/2022, 4:37 PM

I'm working on a bug report for the more serious bug, where just a regular query causes the search to hang forever.

Robert Schultz

11/13/2022, 8:41 PM

I have submitted the second issue: https://github.com/typesense/typesense/issues/789 It covers a forever running search that doesn't seem to ever return. Full test data and reproduction steps included of course.

Kishore Nallan

11/14/2022, 12:54 PM

Thank you, I will be taking a look over this week. Given that these data sets are large, the cycle time will probably be a bit longer. I will keep you posted. Appreciate all the help here in narrowing down.

👍 1

Kishore Nallan

11/28/2022, 2:25 PM

I will be looking into the search_cutoff not kicking in this week. Sorry about the delay!

👍 1

Robert Schultz

11/28/2022, 10:00 PM

No worries, I appreciate the update :)

Kishore Nallan

11/30/2022, 11:56 AM

Fixed in

typesense/typesense:0.24.0.rcn37

-- this was again because of the large offsets which froze up the search in an unexpected manner. The other issue with phrase search still exists. I will be looking into that as well soon.

Robert Schultz

12/01/2022, 5:08 AM

Excellent! Great to hear, thanks!

Open in Slack

Previous Next