Troubleshooting Stalled Writes in TypeSense Instance
TLDR Robert was experiencing typesense instances getting stuck after trying to import documents. Kishore Nallan provided suggestions and added specific logs to diagnose the issue. The two identified queries causing troubles but the issues had not been fully resolved yet.
6
1
Oct 24, 2022 (12 months ago)
Robert
11:03 PMRobert
11:03 PMRobert
11:03 PMI20221024 14:01:28.460819 36413 batched_indexer.cpp:250] Running GC for aborted requests, req map size: 0
I20221024 14:01:28.501175 36412 raft_server.cpp:534] Term: 2, last_index index: 1014724, committed_index: 1014724, known_applied_index: 1014724, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 303202476
I20221024 14:01:28.501262 36432 raft_server.h:60] Peer refresh succeeded!
I20221024 14:01:38.501965 36412 raft_server.cpp:534] Term: 2, last_index index: 1014724, committed_index: 1014724, known_applied_index: 1014724, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 303202476
I20221024 14:01:38.502049 36592 raft_server.h:60] Peer refresh succeeded!
Robert
11:03 PMRobert
11:03 PMRobert
11:03 PMRobert
11:03 PMI20221024 18:56:27.481899 36412 raft_server.cpp:778] Waiting for in-flight writes to finish...
I20221024 18:56:27.481905 36412 raft_server.cpp:780] pending_writes: 1
I20221024 18:56:28.482007 36412 raft_server.cpp:780] pending_writes: 1
I20221024 18:56:29.482134 36412 raft_server.cpp:780] pending_writes: 1
Robert
11:04 PMkill -9
to terminate it.Robert
11:04 PMRobert
11:04 PMRobert
11:04 PMOct 25, 2022 (12 months ago)
Kishore Nallan
01:51 AMRobert
12:39 PMKishore Nallan
12:40 PMWe are entering a period of code freeze for 0.24 branch. So 2-3 weeks, barring any new major bugs from existing features.
1
Robert
08:17 PMRobert
08:28 PMOct 26, 2022 (12 months ago)
Kishore Nallan
12:54 AMOct 27, 2022 (12 months ago)
Robert
07:29 PMRobert
07:30 PM1. The instance was serving search traffic
2. I stopped sending search traffic to it entirely
3. I attempted to import documents
4. The server gets stuck with 1 queued_writes and the documents don't import, just hangs
Robert
07:30 PMI20221027 15:19:08.324472 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 0
I20221027 15:19:16.795174 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:16.795248 39850 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:26.795794 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:26.795842 39811 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:36.796442 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:36.796504 39831 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:46.797086 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:46.797152 39740 raft_server.h:60] Peer refresh succeeded!
I20221027 15:19:56.797785 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:19:56.797845 39846 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:06.798342 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:20:06.798398 39804 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:09.327406 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 0
I20221027 15:20:16.798827 39560 raft_server.cpp:540] Term: 5, last_index index: 1028357, committed_index: 1028357, known_applied_index: 1028357, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 309744913
I20221027 15:20:16.798883 39757 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:26.799475 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:26.799559 39788 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:36.800021 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:36.800071 39653 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:46.800534 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:46.800592 39588 raft_server.h:60] Peer refresh succeeded!
I20221027 15:20:56.801030 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:20:56.801095 39781 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:06.801538 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:06.801602 39657 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:10.330410 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 1
I20221027 15:21:16.802002 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:16.802057 39735 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:26.802635 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:26.802727 39691 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:36.803251 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:36.803301 39681 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:46.803797 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:46.803884 39851 raft_server.h:60] Peer refresh succeeded!
I20221027 15:21:56.804364 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:21:56.804412 39655 raft_server.h:60] Peer refresh succeeded!
I20221027 15:22:06.804944 39560 raft_server.cpp:540] Term: 5, last_index index: 1028358, committed_index: 1028358, known_applied_index: 1028358, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 309745014
I20221027 15:22:06.804996 39659 raft_server.h:60] Peer refresh succeeded!
I20221027 15:22:11.333583 39561 batched_indexer.cpp:264] Running GC for aborted requests, req map size: 1
Robert
07:30 PMqueued_writes: 1
repeats forever, it's been about an hour so far. It's as if typesense just serves seaches for hours and hours, then stops getting search traffic, it doesn't want to accept imports somtimes?Server launched with:
typesense-server --data-dir=/mnt/data/typesense0 --api-port=14101 --peering-port=14100 --api-key=REDACTED --api-address=127.0.0.1 --peering-address=127.0.0.1 --snapshot-interval-seconds=86400 --num-documents-parallel-load=10000 --thread-pool-size=400
I imagine, like previous times, if I bring down the database and bring it back up, I'll be able to import the documents just fine. Sadly this is like a 3 hour process, so it's something I'd like to avoid as much as possible. Also sonce this happens sporadically I can't provide a test case.
Any idea how we can troubleshoot this further and figure out the cause/solution?
Is there additional logging that can be 'enabled' that will further explain what write is queued or why it's not progressing?
Or any other way to get additional information about it's current 'stuck' state?
Oct 28, 2022 (12 months ago)
Kishore Nallan
12:53 AMKishore Nallan
12:54 AMKishore Nallan
12:56 AMRobert
02:28 AMFunny enough though, my secondary typesense instance just froze up a few moments ago upon trying to import documents.
It locked up on the very first document it was attempting to import, so it didn't even get one in.
Calling
curl <http://127.0.0.1:8108/stats.json>
yields{
"delete_latency_ms": 0,
"delete_requests_per_second": 0,
"import_latency_ms": 0,
"import_requests_per_second": 0,
"latency_ms": {},
"pending_write_batches": 1,
"requests_per_second": {},
"search_latency_ms": 0,
"search_requests_per_second": 0,
"total_requests_per_second": 0.0,
"write_latency_ms": 0,
"write_requests_per_second": 0
}
Again, this instance was just doing searches. I diverted search traffic away, waited 2 minutes for searches to finish up and then attempted the import.
Robert
02:38 AMRobert
02:55 AMRobert
02:35 PMOct 29, 2022 (12 months ago)
Kishore Nallan
01:47 AMRobert
02:08 AM1
Robert
02:17 PMKishore Nallan
02:19 PM1
Robert
03:00 PMOct 30, 2022 (11 months ago)
Kishore Nallan
11:22 AMtypesense/typesense:0.24.0.rcn25-log1
1. For reads, I log the request_id, query/body at the start and the request_id at the end again to ensure that the request is not blocked anywhere.
2. For writes, I log just before the write lock is acquired and after it is released.
This should give us a insight into where it's getting stuck.
Robert
11:55 AM1
Nov 03, 2022 (11 months ago)
Robert
12:00 PMlog1
version. Haven't had any lock ups since then yet. I'll update again once I have actual lock up logs to share.Kishore Nallan
12:01 PMNov 12, 2022 (11 months ago)
Robert
03:16 AMKishore Nallan
03:28 AMNov 13, 2022 (11 months ago)
Robert
02:51 AMQuery 1 was due to setting
split_join_tokens : "always"
and by changing that to "fallback"
it fixed the issue. I'm not sure if you'd consider that a bug or just a 'risk' with certain datasets. If you do consider that a bug and want me to submit an issue with a test case, let me know. If you consider that just just a risk with this setting, I'd suggest noting in the documentation for split_join_tokens that using "always" can sometimes lead to runaway searches.Query 2 is a much more normal query that I'm still working on bisecting down a smaller dataset for a github issue.
Kishore Nallan
02:54 AMRobert
03:13 AMq : "amanda"
with full query being:{"query_by":"content,ext,filename,title","facet_by":"itemid,catid,genreid,family,formatid","max_facet_values":300,"page":1,"per_page":100,"include_fields":"filename,title,size,duration,ts,family,formatid,href,contentURL,subCount,width,height,ext,extra,label,itemid","highlight_fields":"none","highlight_affix_num_tokens":0,"search_cutoff_ms":40000,"infix":"off","prioritize_exact_match":true,"use_cache":true,"cache_ttl":3600,"sort_by":"_text_match:desc","prefix":true,"num_typos":2,"typo_tokens_threshold":1,"drop_tokens_threshold":1,"split_join_tokens":"always","q":"amanda","filter_by":"family:[image]"}
Setting num_typos to 0 and both tokens_thresholds to 0 didn't do anything, but as soon as I changed split_join_tokens to fallback, near instant result.
Robert
03:18 AMa
man
and
da
etc.It would be cool if
split_join_tokens
was split into two options split_tokens
and join_tokens
as I'd probably want join_tokens
on whenever doing fuzzy matching, where I'd rarely want split_tokens
on, especially since it can lead to a runaway situations such as this one.Robert
03:19 AMsearch_cutoff_ms
would interrupt the split_join_tokens
operation, instead of it potentially running forever.Kishore Nallan
03:21 AMRobert
03:23 AMKishore Nallan
03:25 AMRobert
03:26 AMRobert
03:31 AM1
Robert
04:37 PMIt covers the issue with
split_join_tokens : "always"
causing the search to hang forever. It's a bit lower priority of a bug due to a good workaround by setting split_join_tokens : "fallback"
Robert
04:37 PMRobert
08:41 PMIt covers a forever running search that doesn't seem to ever return. Full test data and reproduction steps included of course.
Nov 14, 2022 (11 months ago)
Kishore Nallan
12:54 PM1
Nov 28, 2022 (10 months ago)
Kishore Nallan
02:25 PM1
Robert
10:00 PMNov 30, 2022 (10 months ago)
Kishore Nallan
11:56 AMtypesense/typesense:0.24.0.rcn37
-- this was again because of the large offsets which froze up the search in an unexpected manner. The other issue with phrase search still exists. I will be looking into that as well soon.Dec 01, 2022 (10 months ago)
Robert
05:08 AMTypesense
Indexed 2776 threads (79% resolved)
Similar Threads
Addressing High CPU Usage in Typesense
Robert reported high CPU usage on Typesense, even after halting all incoming searches. Kishore Nallan suggested logging heavy queries and increasing thread count. The issue was resolved after Robert found and truncated unusually large documents in the database.
Large JSONL Documents Import Issue & Resolution
Suraj was having trouble loading large JSONL documents into Typesense server. After several discussions and attempts, it was discovered that the issue was due to data quality. Once the team extracted the data again, the upload process worked smoothly.
Troubleshooting Typesense 503 Errors and Usage Queries
Kevin encountered 503s using typesense. Jason asked for logs and explained why 503s occur. They made recommendations to remedy the issue and resolved Kevin's import parameter confusion. User was asked to open a github issue for accepting booleans.
Resolving Server Stoppage Issues in Typesense Multi VM Cluster
gaurav faced issues with the Typesense server in a multi VM cluster, including automatic stoppage and errors. Kishore Nallan identified the lack of a quorum and suggested using three nodes. When the issue persisted, they advised running Typesense via `nohup` or `systemd` to prevent session closure from stopping the process.
Typesense Node Stuck in Segfault Loop After Stress Test
Adrian encountered a segfault loop when stress testing a Typesense cluster. Kishore Nallan recommended trying a newer RC build and suggested potential issues with hostname resolution.