So we are having some issues with production types...
# community-help
r
So we are having some issues with production typesense. 255GB dataset. It was running ok for a few hours, good amount of traffic, but then typesense starts pegging all CPU cores at 100%. So brought down search, stopped all searches from coming in, so there is zero requests going into typesense, but even after 5 minutes it's still pegged at 100% max CPU. All searches are being performed with a search_cutoff_ms of 2 minutes, but after 5 minutes of no seaches, the DB is still hung at max CPU usage on all threads assigned to it (96 threads on a 120 core system in this case)
typesense log shows nothing unusual that I can see
Copy code
I20221020 10:50:15.058467 120348 raft_server.h:60] Peer refresh succeeded!
I20221020 10:50:17.698705 120214 batched_indexer.cpp:250] Running GC for aborted requests, req map size: 0
I20221020 10:50:25.059042 120213 raft_server.cpp:534] Term: 6, last_index index: 976479, committed_index: 976479, known_applied_index: 976479, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 280739818
I20221020 10:50:25.059087 120292 raft_server.h:60] Peer refresh succeeded!
I20221020 10:50:35.059688 120213 raft_server.cpp:534] Term: 6, last_index index: 976479, committed_index: 976479, known_applied_index: 976479, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 280739818
I20221020 10:50:35.059742 120355 raft_server.h:60] Peer refresh succeeded!
I20221020 10:50:45.060204 120213 raft_server.cpp:534] Term: 6, last_index index: 976479, committed_index: 976479, known_applied_index: 976479, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 280739818
I20221020 10:50:45.060258 120290 raft_server.h:60] Peer refresh succeeded!
I20221020 10:50:55.060740 120213 raft_server.cpp:534] Term: 6, last_index index: 976479, committed_index: 976479, known_applied_index: 976479, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 280739818
I20221020 10:50:55.060806 120390 raft_server.h:60] Peer refresh succeeded!
I20221020 10:51:05.061237 120213 raft_server.cpp:534] Term: 6, last_index index: 976479, committed_index: 976479, known_applied_index: 976479, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 280739818
I20221020 10:51:05.061286 120313 raft_server.h:60] Peer refresh succeeded!
I20221020 10:51:15.061736 120213 raft_server.cpp:534] Term: 6, last_index index: 976479, committed_index: 976479, known_applied_index: 976479, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 280739818
I20221020 10:51:15.061787 120357 raft_server.h:60] Peer refresh succeeded!
I20221020 10:51:18.702039 120214 batched_indexer.cpp:250] Running GC for aborted requests, req map size: 0
k
There is probably one or more heavy queries which are locking up the cpu. I'll recommend starting with a much lower search cutoff ms value and then enabling slow queries to identify the heavy queries.
What is the average search requests per second you are sending here?
r
Good question. Unless typesense tracks that, I should start tracking that. I'd say maybe 5 to 20 per second.
If I identify heavy queries, there really isn't any way to avoid them right?
I'm thinking I may need to set up a cluster of machines as search traffic may just be too darned high 🙂
k
Some heavy queries are a function of data shape. E.g. a stop word will pull lot of results
r
What worries me the most is that I stopped all incoming searches a while ago and typesense is still grinding away at max CPU. I figured as it completed slow searches that it would have come down.
I'll bring it back up on a fresh typesense instance with a much lower search cutoff, mabye 10 sconds?
k
There are some operations that can't be controlled by cutoff. For e.g. if there is a place where some large sorting is triggered, that's outside the purview of the cutoff logic.
This is usually very rarely an issue but n log n grows rapidly at the larger end so that might be a factor.
r
Is there any logging inside that can be enabled to determine what the typesense server is doing? some data on why it's stuck?
Maybe the search traffic is just too high for 96 threads and I just need a ton more cores and systems...
k
No we don't have a way to instrument on whats happening unless you can use strace or something to see if that capture a trace.
The first step is to log queries and then replay it yourself to get a feel of what's happening.
r
Ok. I noticed that the default thread-pool-size is NUM_CORES * 8. But I've been setting it just 96 threads (on a 120 core system). Wouldn't the default have caused the entire system to lock up in this particular scenario?
I'll add some instrumentation to better log the queries and also get an idea of queries per second. I'll also drop the search way down.
k
Thread pool can be much larger than total cores available because a lot of threads can block on I/O etc.
Unless we know baseline performance, difficult to know if it's a bug, or heavy query it just not enough hardware
r
So on a 120 core system would you recommend setting like a 600+ thread count?
k
Yeah just leaving it as default should be fine, though I don't think that is the root cause here. It's heavy queries or high concurrency
r
By the way, THANK YOU for being so responsive to queries. We just launched this website a few days and so we are running around like crazy trying to get this search to be nice and stable.
Ok, I'll bump up thread count a lot, reduce search cutoff dramatically, get a proper log of all searches being performed and get a query per second measurement. Then see what's after that. Thanks again!
k
Welcome! Once you have a query log, you can kinda run your own benchmark behind the scenes as well.
r
One more point of info, doing a
Ctrl-C
to close the server yields:
Copy code
^CI20221020 11:11:19.171216 120015 typesense_server_utils.cpp:47] Stopping Typesense server...
I20221020 11:11:20.134568 120213 typesense_server_utils.cpp:335] Typesense peering service is going to quit.
I20221020 11:11:20.134603 120213 raft_server.cpp:774] Set shutting_down = true
I20221020 11:11:20.134610 120213 raft_server.cpp:778] Waiting for in-flight writes to finish...
I20221020 11:11:20.134615 120213 raft_server.cpp:784] Replication state shutdown, store sequence: 280739818
I20221020 11:11:20.134624 120213 raft_server.cpp:788] node->shutdown
I20221020 11:11:20.134630 120213 node.cpp:929] node default_group:127.0.0.1:21807:21808 shutdown, current_term 6 state LEADER
I20221020 11:11:20.134748 120213 replicator.cpp:1475] Group default_group Fail to find the next candidate
I20221020 11:11:20.134775 120213 raft_server.cpp:792] node->join
I20221020 11:11:20.134771 120290 raft_server.h:287] Node stepped down : Raft node is going to quit.
I20221020 11:11:20.134802 120290 raft_server.h:291] This node is down
I20221020 11:11:20.134822 120213 node.cpp:929] node default_group:127.0.0.1:21807:21808 shutdown, current_term 6 state SHUTDOWN
I20221020 11:11:20.176208 120213 typesense_server_utils.cpp:340] raft_server.stop()
I20221020 11:11:20.176235 120213 server.cpp:1100] Server[braft::RaftStatImpl+braft::FileServiceImpl+braft::RaftServiceImpl+braft::CliServiceImpl] is going to quit
I20221020 11:11:20.176502 120213 typesense_server_utils.cpp:343] raft_server.join()
I20221020 11:11:20.176612 120213 typesense_server_utils.cpp:346] Typesense peering service has quit.
I20221020 11:11:20.177171 120213 typesense_server_utils.cpp:469] Shutting down batch indexer...
I20221020 11:11:20.177182 120213 typesense_server_utils.cpp:472] Waiting for batch indexing thread to be done...
I20221020 11:11:20.763222 120214 batched_indexer.cpp:282] Notifying batch indexer threads about shutdown...
I20221020 11:11:20.763541 120214 batched_indexer.cpp:290] Batched indexer threadpool shutdown...
I20221020 11:11:20.766474 120213 typesense_server_utils.cpp:475] Shutting down server_thread_pool
But it's just stuck now... for a while now, still at the same max 100% CPU usages. Is it basically stuck? Does it not force close it's thread pool? Should I
kill -9
it?
k
Yes kill -9
If some thread is in a non interruptible loop then yes it can't be interrupted to shut down cleanly. If we could then cutoff would have kicked in as well.
r
So even with a 20s max search time, I was still seeing typesense lock up some CPUs at 100% for days on end. So, I double checked the data we index into typesense, to make sure we didn't have any sort of HUGE document that maybe typesense was choking on... sure enough I found a handful of documents that were like 400MB+ in size. My hunch is that every so often a particular search would match these huge docs and typesense would basically get stuck processing them. So I truncated this data to MAX 5MB and re-indexed a fresh database. Been running it now for an hour or so and it's looking pretty good.
I now have a detailed search log, with timestamps. I recorded when I started this instance. My plan is to monitor this new fresh instance, and if I see and threads lock up to 100% and never release, record when that happened. I'll then work on the side to reproduce it and narrow down the search entries until I find the offending search. So if this happens again, I should be able to provide you the data and search that causes it to happen via a github issue. Hopefully though it won't even come to that 🙂
Thanks AGAIN for creating typesense and being so awesome here on slack!
k
Ah, that large document certainly can cause issues. Each word's offset in a document is stored in-memory. And this is used during both search and highlight. For such large docs things can do unbounded. We have not actually tried with such large docs so hard to say what could have gone wrong but N log N algorithms get "large" very fast.
r
Just wanted to give you an update. TypeSense has been ROCK SOLID the past day. Working perfectly, no run-away CPU usage. So it almost certainly was the huge documents that were the cause of the problem. So no bug report will be incoming. Thanks again!
k
Glad to hear and thanks for the update, Robert.