After this issue <https://typesense-community.slac...
# community-help
a
After this issue https://typesense-community.slack.com/archives/C01P749MET0/p1745807284729069 we seem to have some data that is out of sync and I’m unsure why (or how to sort the issue past what I’ve tried) Steps I’ve done: 1. Removed the TS server from our target group to prevent traffic hitting it 2. Stopped the TS process 3. Deleted all data in the
/var/lib/typesense
directory 4. Restarted the server After this happens, the TS server in question has more documents than the 2 other nodes we have which I’m not sure how this happens. We’re having an issue with a collection returning blank data (we’re debugging this more now). How does the restored data have more records than the nodes it restored from? When we export the collection from each node the node1 data file does have 3 extra JSON documents in it which aren’t in the other 2 nodes? Have missed something in the data delete process? Thanks!
These servers are on v28 (release version)
m
To add more context - this simplified “filter by” query returns no results on the recovered node1, but does return results on the other two:
Copy code
curl "${TYPESENSE_BASE_URL}/collections/products/documents/search?q=*&filter_by=categoryids:=11" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"
Node1:
Copy code
{
    "facet_counts": [],
    "found": 0,
    "hits": [],
    "out_of": 2190,
...
Other nodes:
Copy code
"facet_counts": [],
  "found": 60,
  "hits": [...],
  "out_of": 2187,
...
ps. other filter_by queries work return some results on the node1 but counts don’t match
j
Are you using JOINs by any chance?
m
we do for other collections/queries, but not in this one
j
Is this collection referenced by another collection though?
Even if it doesn't have a reference field inside itself
m
yep, an embedding collection referencing this one
j
This sounds similar to an issue we fixed in recent RC builds. So you might want to try upgrading to it. That said, there's one known issue with the recent RC build, where large multi_search requests end up causing a crash, which we're still fixing
m
Right, good to know! So what is the preferred action here, to wait for the crash fix and then upgrade? We’ve currently taken node1 out of the LB so it doesn’t get traffic. But we’re also careful with rolling RCs to prod.
j
The multi_search issue doesn't seem to affect all users though, only select data shapes. So you might want to try upgrading 1 out of 3 nodes to the new RC build to see if the issue occurs with your dataset, after letting it run for 24-48 hours, then upgrade the next node, and so on
m
ok great, we’ll try that - thanks for your quick reply!
👍 1
a
Thanks @Jason Bosco should we upgrade the server that is returning no results (it’s a follower) if we’re picking one?
j
Yeah, I'd recommend a follower
a
Thanks, and do we delete the data dir as part of this or let it use the data there?
j
You should be able to re-use the existing data dir
a
Thanks, we did this and put the server back into the LB .. but it crashed, so guessing our data is a use case for the crash you are fixing 🙂
k
Can you please share the stack trace from the crash?
a
Actually looks like we didn’t log one, the server just logged as below and restarted + loaded indexes again.
Copy code
I20250429 01:34:54.736620  2764 raft_server.cpp:692] Term: 47, pending_queue: 0, last_index: 32417417, committed: 32417417, known_applied: 32417417, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 345957379
Log file created at: 2025/04/29 01:36:52
Running on machine: prod-typesense1
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20250429 01:36:52.914856  3052 typesense_server_utils.cpp:353] Starting Typesense 29.0.rc15
I20250429 01:36:52.914935  3052 typesense_server_utils.cpp:356] Typesense is using jemalloc.
I20250429 01:36:52.915086  3052 typesense_server_utils.cpp:418] Thread pool size: 16
I20250429 01:36:52.916780  3052 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/db
I20250429 01:36:52.932957  3052 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/meta
I20250429 01:36:52.939391  3052 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/analytics/db
I20250429 01:36:52.946942  3052 ratelimit_manager.cpp:546] Loaded 0 rate limit rules.
I20250429 01:36:52.946966  3052 ratelimit_manager.cpp:547] Loaded 0 rate limit bans.
I20250429 01:36:52.947021  3052 typesense_server_utils.cpp:585] Starting API service...
I20250429 01:36:52.947180  3052 http_server.cpp:180] Typesense has started listening on port 8108
I20250429 01:36:52.947865  3177 batched_indexer.cpp:196] Starting batch indexer with 16 threads.
I20250429 01:36:52.947966  3179 typesense_server_utils.cpp:528] Conversation garbage collector thread started.
I20250429 01:36:52.948277  3177 batched_indexer.cpp:201] BatchedIndexer skip_index: -9999
I20250429 01:36:52.953171  3176 server.cpp:1181] Server[braft::RaftStatImpl+braft::FileServiceImpl+braft::RaftServiceImpl+braft::CliServiceImpl] is serving on port=8107.
I20250429 01:36:52.953197  3176 server.cpp:1184] Check out <http://prod-typesense1:8107> in web browser.
I20250429 01:36:52.954855  3176 raft_server.cpp:75] Nodes configuration: 10.50.3.101:8107:8108,10.50.4.82:8107:8108,10.50.5.218:8107:8108
I20250429 01:36:52.955206  3176 log.cpp:690] Use murmurhash32 as the checksum type of appending entries
I20250429 01:36:52.955240  3176 log.cpp:1172] log load_meta /var/lib/typesense/state/log/log_meta first_log_index: 32409936 time: 25
I20250429 01:36:52.955260  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32412141 last_index: 32413175
I20250429 01:36:52.955269  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32415023 last_index: 32415991
I20250429 01:36:52.955276  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32410512 last_index: 32411010
I20250429 01:36:52.955281  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32416869 last_index: 32416875
I20250429 01:36:52.955284  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32411011 last_index: 32411431
I20250429 01:36:52.955307  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32409741 last_index: 32410511
I20250429 01:36:52.955314  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32413176 last_index: 32414250
I20250429 01:36:52.955319  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32411432 last_index: 32412140
I20250429 01:36:52.955324  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32414251 last_index: 32415022
I20250429 01:36:52.955329  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32416876 last_index: 32417138
I20250429 01:36:52.955336  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32415992 last_index: 32416868
I20250429 01:36:52.955343  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32409741 last_index: 32410511
I20250429 01:36:52.956046  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32410512 last_index: 32411010
I20250429 01:36:52.956537  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32411011 last_index: 32411431
I20250429 01:36:52.956930  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32411432 last_index: 32412140
I20250429 01:36:52.957600  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32412141 last_index: 32413175
I20250429 01:36:52.958635  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32413176 last_index: 32414250
I20250429 01:36:52.959641  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32414251 last_index: 32415022
I20250429 01:36:52.960390  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32415023 last_index: 32415991
I20250429 01:36:52.961262  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32415992 last_index: 32416868
I20250429 01:36:52.962060  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32416869 last_index: 32416875
I20250429 01:36:52.962085  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32416876 last_index: 32417138
k
Are you able to reproduce this trivially locally or on a staging env?
a
we can probably try to query the server directly and see if we can make it crash
k
Yeah that will be very useful. With an easy reproduction we can quickly identify and fix and affirm the fix also.
a
All good, working on it - we’ve enabled search logging and can see the query logged right before the crash so we’ll try and replicate this with a shell command and see if we can force it to crash. If we do I’ll compress the data dir and share it with the search we run 🙂
OOI does a search get logged before it runs, so the last search in the logs is likely the issue?
k
--enable-search-logging
logs query before the search request cycle. So one of the last few queries should be caught before crash (if you have high concurrency)
a
perfect, we’ll try and replicate the crash from these 🙂
👍 1
Nothing useful yet on this - we’ve had a couple of crashes but can’t find a pattern in the queries
Syslog is showing this error when we’re crashing, where does TS core-dump (can we configure something to catch this?)
Copy code
Apr 29 05:10:20 prod-typesense1 typesense-server[5343]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/585/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
Apr 29 05:10:23 prod-typesense1 systemd[1]: typesense-server.service: Main process exited, code=dumped, status=11/SEGV
Apr 29 05:10:23 prod-typesense1 systemd[1]: typesense-server.service: Failed with result 'core-dump'.
Apr 29 05:10:23 prod-typesense1 systemd[1]: typesense-server.service: Consumed 2min 2.186s CPU time.
Apr 29 05:10:23 prod-typesense1 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 2.
Apr 29 05:10:23 prod-typesense1 systemd[1]: typesense-server.service: Consumed 2min 2.186s CPU time.
Apr 29 05:10:23 prod-typesense1 typesense-server[5699]: Log directory is configured as: /var/log/typesense
k
Core dump must be configured. We don't really set it up as part of DEB.
a
ok, all good - we just had another crash. I’ll see if I can find a common query
👍 1
We’ll try and run some of these queries on our data tomorrow morning and see if we can force a crash. ChatGPT found a bunch of common queries running just before the crashes we saw (attached in case it helps)
k
Ok 👍
a
We’ve run all these queries and can’t replicate the crash with single queries. Is it a concurrency issue? I’m not sure how we’d simulate a lot of traffic to aside from production traffic
k
We will have to get a core dump otherwise. We've never had an issue like this which has caused a crash without a stack trace.. the other option is to downgrade 29.0 RC builds down to a version where we can pinpoint which version causes the crash.
Nothing in /var/syslog also?
We also see this in syslog, I think the times correlate to the crashes from yesterday btu would need to re-enable and confirm
Copy code
Apr 28 23:53:01 prod-typesense1 typesense-server[482]: E20250428 23:53:01.807976   679 core_api.cpp:633] JSON error: [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON
Apr 28 23:54:02 prod-typesense1 typesense-server[482]: E20250428 23:54:02.357002   676 core_api.cpp:633] JSON error: [json.exception.parse_error.101] parse error at line 5, column 13: syntax error while parsing object key - unexpected '}'; expected string literal
Apr 29 04:39:38 prod-typesense1 typesense-server[3052]: E20250429 04:39:38.152671  3081 core_api.cpp:843] JSON error: [json.exception.parse_error.101] parse error at line 11, column 13: syntax error while parsing object key - unexpected '}'; expected string literal
Apr 29 04:39:41 prod-typesense1 typesense-server[3052]: E20250429 04:39:41.781852  3082 core_api.cpp:843] JSON error: [json.exception.parse_error.101] parse error at line 11, column 13: syntax error while parsing object key - unexpected '}'; expected string literal
Do you have any ideas on which version we should try as a start?
@Jason Bosco can you let us know what RC version this fix was, we’ll start there and can upgrade/downgrade if we see crashes. https://typesense-community.slack.com/archives/C01P749MET0/p1745885906042689?thread_ts=1745881111.998959&amp;cid=C01P749MET0
k
We don't have an RC for that yet, I will be building one shortly. I will share the build when it's done.
a
ah ok, we had some more issues today and have hit a bit of a problem. The remaining servers had threadpool problems and we’re now not returning the full data sets from the indexes as far as we can tell. Trying to work out the best way forward. 1. We don’t want to upgrade to the latest RC as this has a crashing issue 2. If we stay on v28 it seems like we have data missing from results
k
This new build will have that multi search issue fixed. I will be able to get this published in a couple of hours.
a
the crash?
We’ve got a couple of servers returning blank data now so I think we need to get them on to the latest RC as we did with the server yesterday..
We’ve updated to the latest RC for now - this has sorted the blank responses we were seeing. We’ll update to your new build once confirmed. Thanks for your help
Ok, we’re between a rock and a hard place now. We need the RC for the join bug we were seeing with v28 .. but we’re seeing crashes on RC15 every 5-10 mins currently 💥 We’ll be looking forward to the new RC 😄
k
Will be available in about 45 mins. I've kicked off the deployment.
🙌 1
a
You’ll be my hero for the day, I’m currently baby sitting the cluster and restarting things as needed 🔥
Looks like the build we need is there - ok to deploy?
k
Yes, you can go ahead
Hopefully this fixes the crash, if the multi-search body is indeed the root cause. Otherwise, we are back to square 1.
a
files might not be ready yet
k
Yeah the build is not yet over. It publishes stuff in sequence.
You need the ARM binary?
a
yep
getting them now
k
Yes it exists now:
Copy code
<https://dl.typesense.org/releases/29.0.rc16/typesense-server-29.0.rc16-linux-arm64.tar.gz>
a
Thanks, we’ve just deployed this now and will monitor. We will need to tink of a backup if this doesn’t work as v28 has the join bug that returns blank results for us
Looks like we just had a crash, but all 3 servers crashed at exactly the same time
Logs don’t show stack trace
Copy code
I20250430 05:06:42.392635 17604 raft_server.cpp:692] Term: 69, pending_queue: 0, last_index: 32500782, committed: 32500782, known_applied: 32500782, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 346964797
I20250430 05:06:48.530647 17605 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 0
Log file created at: 2025/04/30 05:06:50
Running on machine: prod-typesense1
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20250430 05:06:50.916981 17773 typesense_server_utils.cpp:353] Starting Typesense 29.0.rc16
I20250430 05:06:50.917088 17773 typesense_server_utils.cpp:356] Typesense is using jemalloc.
I20250430 05:06:50.917270 17773 typesense_server_utils.cpp:418] Thread pool size: 32
k
Sigh. Let me see if I can create a new branch from v28 GA and pick only the join related bug fixes onto that.
a
I’ll see if we got anything in syslog
syslog shows
Copy code
Apr 30 05:06:50 prod-typesense1 typesense-server[17399]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/585/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
Apr 30 05:06:50 prod-typesense1 systemd[1]: typesense-server.service: Main process exited, code=killed, status=11/SEGV
Apr 30 05:06:50 prod-typesense1 systemd[1]: typesense-server.service: Failed with result 'signal'.
Apr 30 05:06:50 prod-typesense1 systemd[1]: typesense-server.service: Consumed 1min 28.032s CPU time.
Apr 30 05:06:50 prod-typesense1 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 1.
Apr 30 05:06:50 prod-typesense1 systemd[1]: typesense-server.service: Consumed 1min 28.032s CPU time.
Apr 30 05:06:50 prod-typesense1 typesense-server[17773]: Log directory is configured as: /var/log/typesense
Apr 30 05:07:25 prod-typesense1 typesense-server[17773]: E20250430 05:07:25.754397 17986 raft_server.cpp:767] 5987 lagging entries > healthy read lag of 1000
Apr 30 05:07:25 prod-typesense1 typesense-server[17773]: E20250430 05:07:25.755008 17986 raft_server.cpp:779] 5987 lagging entries > healthy write lag of 500
Apr 30 05:07:30 prod-typesense1 typesense-server[17773]: E20250430 05:07:30.424016 17947 raft_server.cpp:779] 938 lagging entries > healthy write lag of 500
interestingly when they crash - the servers sometimes don’t recover and get caught in a loop with this message. The only fix I can find is killing the server, deleting the data and letting it reload
Copy code
I20250430 05:14:18.216732 18879 raft_server.h:294] Configuration of this group is 10.50.4.82:8107:8108,10.50.3.101:8107:8108,10.50.5.218:8107:8108
E20250430 05:14:22.778808 18839 raft_server.cpp:767] 1116 lagging entries > healthy read lag of 1000
E20250430 05:14:22.779194 18839 raft_server.cpp:779] 1116 lagging entries > healthy write lag of 500
I20250430 05:14:23.779407 18839 raft_server.cpp:692] Term: 70, pending_queue: 0, last_index: 32501193, committed: 32501193, known_applied: 32494796, applying: 32500077, pending_writes: 0, queued_writes: 137, local_sequence: 346946438
E20250430 05:14:31.779937 18839 raft_server.cpp:767] 1116 lagging entries > healthy read lag of 1000
E20250430 05:14:31.780007 18839 raft_server.cpp:779] 1116 lagging entries > healthy write lag of 500
I20250430 05:14:33.780211 18839 raft_server.cpp:692] Term: 70, pending_queue: 0, last_index: 32501193, committed: 32501193, known_applied: 32494796, applying: 32500077, pending_writes: 0, queued_writes: 137, local_sequence: 346946438
E20250430 05:14:40.780691 18839 raft_server.cpp:767] 1116 lagging entries > healthy read lag of 1000
E20250430 05:14:40.780755 18839 raft_server.cpp:779] 1116 lagging entries > healthy write lag of 500
I20250430 05:14:43.781015 18839 raft_server.cpp:692] Term: 70, pending_queue: 0, last_index: 32501193, committed: 32501193, known_applied: 32494796, applying: 32500077, pending_writes: 0, queued_writes: 137, local_sequence: 346946438
I20250430 05:14:44.237591 18840 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 137
E20250430 05:14:49.781430 18839 raft_server.cpp:767] 1119 lagging entries > healthy read lag of 1000
E20250430 05:14:49.781500 18839 raft_server.cpp:779] 1119 lagging entries > healthy write lag of 500
The server also doesn’t seem to stop with
sudo systemctl stop typesense-server.service
- I end up killing it
k
We need a core dump. These logs are not helpful unfortunately for the root cause.
a
ok, I’ll check what I can do to configure one
k
Meanwhile I have cherry-picked a few commits on a new branch and will kick off a debug build in a few mins.
Few reference related fixes alone.
a
ok nice, thanks - I think references is what caused our results to be blank 🙌
k
Meanwhile can you try
29.0.rc10
-- this should have all 3 reference related fixes I have cherry-picked.
a
ok, we’ll try that now
cool - tested that on one node and that returns data for us so the fix is good up to here. We’ll deploy this over the other nodes and see if we get crashes.
I think I’ve also configured coredumps now
👍 1
k
If rc10 works then we could possibly try to bisect and identify the problematic build by trying rc13, and so on on just one of the 3 nodes.
a
sounds good, we’re just deploying to the last node now
k
All good so far?
a
no, confirming this one is crashing as well
different crash though, it seems to hang and responses get slower and slower .. trying to work it out
k
Meanwhile the cherry-pick build is ready:
28.0.refpatch1
-- this contains only 3 reference related patches.
a
ok, nice thanks
we’ll dop this now
Ok, we’ve deployed this now but seeing the blank data again 😕 this is what the RC data fixed
k
Ok we are left with no choice but to either have core dump. Or I need the data dir and some 10-15 queries with which we can try running locally with some additional traces and tracking.
I can try running those queries concurrently locally
a
We’re just working out a plan to change the collections that are a problem and roll back to v28 stable
👍 1
k
If you are able to share the data and queries we will be able to try and reproduce the issue though.
a
We’ve just rolled back so hopefully it’s stable .. working on getting results (we might need to change the collection)
Let me compress the data dir now
shared 🙂
We’ve found that running v28 we can delete a document + re-merge it and it fixes our missing data issue in the short term so we’re working on this
k
Got the attachment. Does it also have the queries?
a
no, we’ll need to grab some of these once we get some sites back with data 🙂
We’ve re-indexed and things are looking good for us now. So the crash must be pre v29.RC10
k
Yeah looks like there are 2 different issues at play: • a reference/join related bug which has been fixed in v29 RC, but • another bug which causes the crashes
a
yep agreed. The reference/join bug wasn’t fixed in the special build you made so it must be a different commit.
rc10 crashed for us so that removes versions above this as a potential problem (hopefully)
v28 has been stable since we rolled back
k
Do you know if there were any active writes to the cluster during the crash?
a
most likely, they are pretty busy and sync frequently
k
One way to isolate the issue further: a) Pause writes to the cluster b) Upgrade only 1 node to latest rc16 c) See if it still crashes: this way we know it's purely a search related issue
a
We’ll struggle to pause writes in production .. ’ll have a think about how we can setup a replica cluster and simulate the traffic (I must not have setup coredumps properly as we didn’t collect any)
Are you seeing the same frequency of crashes in any of your cloud clusters?
k
One other customer reported issues, but they downgraded because they didn't really need the RC build urgently.
I've some clues to what's going wrong. Continuing to work on this, hope to fix this today.
a
Ok thanks, we’re unsure how to test it properly .. we might see if we can replicate traffic to a test cluster. Hard when dev is fine but prod crashes 🙂
Interestingly we seem to be having a hang issue on v28 today (just one server) - it seems that the threadpool just increases over time and the server is unhealthy
Copy code
W20250501 04:43:46.288156  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 464, thread_pool_len: 32
E20250501 04:43:47.029040  2629 raft_server.cpp:758] 1706 lagging entries > healthy read lag of 1000
E20250501 04:43:47.029093  2629 raft_server.cpp:770] 1706 lagging entries > healthy write lag of 500
W20250501 04:43:47.105521  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 465, thread_pool_len: 32
W20250501 04:43:47.859311  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 466, thread_pool_len: 32
I20250501 04:43:48.916607  2630 batched_indexer.cpp:422] Running GC for aborted requests, req map size: 12
I20250501 04:43:48.916646  2630 batched_indexer.cpp:430] Stuck req_key: 1746073095916987
I20250501 04:43:48.916652  2630 batched_indexer.cpp:430] Stuck req_key: 1746073096124772
I20250501 04:43:48.916656  2630 batched_indexer.cpp:430] Stuck req_key: 1746073096222929
I20250501 04:43:48.916661  2630 batched_indexer.cpp:430] Stuck req_key: 1746073132203666
I20250501 04:43:48.916666  2630 batched_indexer.cpp:430] Stuck req_key: 1746073132249008
then 10 mins later
Copy code
E20250501 04:55:11.082156  2629 raft_server.cpp:758] 2341 lagging entries > healthy read lag of 1000
E20250501 04:55:11.082221  2629 raft_server.cpp:770] 2341 lagging entries > healthy write lag of 500
I20250501 04:55:14.082486  2629 raft_server.cpp:683] Term: 81, pending_queue: 0, last_index: 32593772, committed: 32593772, known_applied: 32591430, applying: 32591431, pending_writes: 0, queued_writes: 12
, local_sequence: 347997206
W20250501 04:55:15.734237  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 681, thread_pool_len: 32
W20250501 04:55:16.605015  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 682, thread_pool_len: 32
W20250501 04:55:17.429947  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 683, thread_pool_len: 32
W20250501 04:55:18.203313  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 684, thread_pool_len: 32
Is the expectation that the threadpool should clear and cause some latency, or once it’s exhausted the server will just stop responding?
k
This is different. If you have a lot of concurrent requests, thread pool starvation can occur. Just increase the default thread pool size bye explicitly setting the server side paramter.
a
ok thanks, I’ve made it 64 .. but guess we can increase higher as the CPU on the server is still ~4-5%
How are your clues to the fix of the crashing issue going?
k
Just publishing
29.0.rc17
which should be available shortly. This fix worked on another customer who was having similar issues.
a
ok thanks, can see the new build now.
Is the best approach for us to install this on one node, then have that node in production?
k
Yes, upgrade only 1 node and check if there is a crash like last time.
a
Thanks, we’ll do that now
What was the fix out of interest?
k
Had to revert a totally unrelated PR. Need to see what blew up, haven't gotten around to that. I was just trying to revert PRs to check what works, and caught this.
a
interesting!
We’ve put one server with this version into production
We’ll keep an eye on it and let you know how it goes
👍 1
Nearly 15 mins in and so fsr it’s looking ok - we would have had a crash by now with the previous versions I think. We’ll monitor for the next few hours.
k
Yes I'm reasonably confident that this fixes it. We will now figure out what went wrong with that PR.
a
Great news, we’re ~25 mins in and no crash with quite high traffic (evening for us) so looking good.
🙌 1
Ok, look like we might have had a crash - but it was ~1 hour in
k
No stack trace?
In the other customer cluster that had this issue, it has been running without any problems for 3+ hours.
a
no stack trace, and the odd thing is it doesn’t recover after the crash - it restarts, but then gets stuck logging this sort of message. I just manually stopped/started it and it’s back to healthy.. but strange the restart after the crash stays unhealthy
Copy code
I20250501 06:49:03.587236  8557 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 95
E20250501 06:49:04.974900  8556 raft_server.cpp:767] 5101 lagging entries > healthy read lag of 1000
E20250501 06:49:04.974954  8556 raft_server.cpp:779] 5101 lagging entries > healthy write lag of 500
I20250501 06:49:07.975206  8556 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599607, committed: 32599607, known_applied: 32588619, applying: 32594506, pending_writes: 0, queued_writes: 103, local_sequence: 348033533
E20250501 06:49:13.975621  8556 raft_server.cpp:767] 5101 lagging entries > healthy read lag of 1000
E20250501 06:49:13.975673  8556 raft_server.cpp:779] 5101 lagging entries > healthy write lag of 500
I20250501 06:49:17.975948  8556 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599607, committed: 32599607, known_applied: 32588619, applying: 32594506, pending_writes: 0, queued_writes: 103, local_sequence: 348033533
E20250501 06:49:22.976302  8556 raft_server.cpp:767] 5101 lagging entries > healthy read lag of 1000
E20250501 06:49:22.976364  8556 raft_server.cpp:779] 5101 lagging entries > healthy write lag of 500
I20250501 06:49:27.976739  8556 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599607, committed: 32599607, known_applied: 32588619, applying: 32594506, pending_writes: 0, queued_writes: 103, local_sequence: 348033533
E20250501 06:49:31.977033  8556 raft_server.cpp:767] 5117 lagging entries > healthy read lag of 1000
E20250501 06:49:31.977088  8556 raft_server.cpp:779] 5117 lagging entries > healthy write lag of 500
I20250501 06:49:37.977525  8556 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599663, committed: 32599663, known_applied: 32588619, applying: 32594506, pending_writes: 0, queued_writes: 103, local_sequence: 348033533
k
On restart it worked?
a
yep
sudo systemctl stop typesense-server.service
took about 1 min .. but then starting it and it catches up
Copy code
E20250501 06:54:13.843116  9841 raft_server.cpp:767] 11175 lagging entries > healthy read lag of 1000
E20250501 06:54:13.843695  9841 raft_server.cpp:779] 11175 lagging entries > healthy write lag of 500
I20250501 06:54:13.843722  9841 raft_server.h:299] Node starts following { leader_id=10.50.3.101:8107:8108, term=81, status=Follower receives message from new leader with the same term.}
I20250501 06:54:13.911813  9843 log.cpp:536] close a full segment. Current first_index: 32599001 last_index: 32599871 raft_sync_segments: 0 will_sync: 1 path: /var/lib/typesense/state/log/log_00000000000032599001_00000000000032599871
I20250501 06:54:13.911964  9843 log.cpp:550] Renamed `/var/lib/typesense/state/log/log_inprogress_00000000000032599001' to `/var/lib/typesense/state/log/log_00000000000032599001_00000000000032599871'
I20250501 06:54:13.912024  9843 log.cpp:114] Created new segment `/var/lib/typesense/state/log/log_inprogress_00000000000032599872' with fd=65
E20250501 06:54:22.696559  9809 raft_server.cpp:767] 8019 lagging entries > healthy read lag of 1000
E20250501 06:54:22.697060  9809 raft_server.cpp:779] 8019 lagging entries > healthy write lag of 500
I20250501 06:54:23.697280  9809 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599895, committed: 32599895, known_applied: 32588619, applying: 32591899, pending_writes: 0, queued_writes: 30, local_sequence: 348002260
E20250501 06:54:31.698508  9809 raft_server.cpp:767] 4583 lagging entries > healthy read lag of 1000
E20250501 06:54:31.698585  9809 raft_server.cpp:779] 4583 lagging entries > healthy write lag of 500
I20250501 06:54:33.698799  9809 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599895, committed: 32599895, known_applied: 32588619, applying: 32596816, pending_writes: 0, queued_writes: 73, local_sequence: 348062604
E20250501 06:54:40.700843  9809 raft_server.cpp:767] 1401 lagging entries > healthy read lag of 1000
E20250501 06:54:40.700923  9809 raft_server.cpp:779] 1401 lagging entries > healthy write lag of 500
I20250501 06:54:43.702620  9809 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599895, committed: 32599895, known_applied: 32588619, applying: 32599610, pending_writes: 0, queued_
Syslog shows
Copy code
May  1 06:25:33 prod-typesense2 typesense-server[7126]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/370/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
May  1 06:25:40 prod-typesense2 systemd[1]: typesense-server.service: Main process exited, code=dumped, status=11/SEGV
May  1 06:25:40 prod-typesense2 systemd[1]: typesense-server.service: Failed with result 'core-dump'.
May  1 06:25:40 prod-typesense2 systemd[1]: typesense-server.service: Consumed 3min 39.444s CPU time.
May  1 06:25:40 prod-typesense2 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 1.
May  1 06:25:40 prod-typesense2 systemd[1]: typesense-server.service: Consumed 3min 39.444s CPU time.
May  1 06:25:40 prod-typesense2 typesense-server[8285]: Log directory is configured as: /var/log/typesense
May  1 06:26:08 prod-typesense2 typesense-server[8285]: E20250501 06:26:08.126158  8570 raft_server.cpp:767] 9922 lagging entries > healthy read lag of 1000
May  1 06:26:08 prod-typesense2 typesense-server[8285]: E20250501 06:26:08.126204  8570 raft_server.cpp:779] 9922 lagging entries > healthy write lag of 500
k
Ok will check.
a
It didn’t generate a core-dump so trying to work out how to enable that
Crashed on restart this time, It did dump a stack trace
Copy code
E20250501 07:19:37.570082 13160 backward.hpp:4200] Stack trace (most recent call last) in thread 13160:
E20250501 07:19:37.570628 13160 backward.hpp:4200] #10   Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in 
E20250501 07:19:37.570703 13160 backward.hpp:4200] #9    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xeab096525edb, in 
E20250501 07:19:37.570770 13160 backward.hpp:4200] #8    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xeab0964bd5b7, in 
E20250501 07:19:37.570823 13160 backward.hpp:4200] #7    Object "/usr/bin/typesense-server", at 0xafb271b5c68b, in execute_native_thread_routine
E20250501 07:19:37.570878 13160 backward.hpp:4200] #6  | Source "include/threadpool.h", line 59, in operator()
E20250501 07:19:37.570930 13160 backward.hpp:4200]       Source "/usr/include/c++/10/future", line 1592, in ThreadPool [0xafb26f59801f]
E20250501 07:19:37.570984 13160 backward.hpp:4200] #5  | Source "/usr/include/c++/10/future", line 1459, in _M_set_result
E20250501 07:19:37.571036 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/future", line 412, in call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>()>*, bool*>
E20250501 07:19:37.571091 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/mutex", line 729, in __gthread_once
E20250501 07:19:37.571147 13160 backward.hpp:4200]       Source "/usr/include/aarch64-linux-gnu/c++/10/bits/gthr-default.h", line 700, in _M_run [0xafb26f7776a3]
E20250501 07:19:37.571211 13160 backward.hpp:4200] #4    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xeab0964c25c3, in 
E20250501 07:19:37.571266 13160 backward.hpp:4200] #3  | Source "/usr/include/c++/10/future", line 572, in operator()
E20250501 07:19:37.571283 13160 backward.hpp:4200]       Source "/usr/include/c++/10/bits/std_function.h", line 622, in _M_do_set [0xafb26f59732b]
E20250501 07:19:37.571298 13160 backward.hpp:4200] #2  | Source "/usr/include/c++/10/bits/std_function.h", line 292, in __invoke_r<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<_Fn, _Alloc, _Res(_Args ...)>::_M_run<std::_Bind<Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>()>, std::allocator<int>, void, {}>::<lambda()>, void>&>
E20250501 07:19:37.571308 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/bits/invoke.h", line 115, in __invoke_impl<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<_Fn, _Alloc, _Res(_Args ...)>::_M_run<std::_Bind<Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>()>, std::allocator<int>, void, {}>::<lambda()>, void>&>
E20250501 07:19:37.571323 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/bits/invoke.h", line 60, in operator()
E20250501 07:19:37.571329 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/future", line 1397, in operator()
E20250501 07:19:37.571334 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/future", line 1456, in __invoke_r<void, std::_Bind<Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>()>&>
E20250501 07:19:37.571341 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/bits/invoke.h", line 110, in __invoke_impl<void, std::_Bind<Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>()>&>
E20250501 07:19:37.571346 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/bits/invoke.h", line 60, in operator()<>
E20250501 07:19:37.571353 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/functional", line 499, in __call<void>
E20250501 07:19:37.571359 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/functional", line 416, in __invoke<Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>&>
E20250501 07:19:37.571364 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/bits/invoke.h", line 95, in __invoke_impl<void, Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>&>
E20250501 07:19:37.571370 13160 backward.hpp:4200]       Source "/usr/include/c++/10/bits/invoke.h", line 60, in _M_invoke [0xafb26f7abce3]
E20250501 07:19:37.571375 13160 backward.hpp:4200] #1    Source "src/index.cpp", line 679, in operator() [0xafb26f7ab8e7]
E20250501 07:19:37.571380 13160 backward.hpp:4200] #0    Source "src/index.cpp", line 1000, in index_field_in_memory [0xafb26f7aab70]
I20250501 07:19:38.666558 13160 housekeeper.cpp:96] No in-flight search queries were found.
E20250501 07:19:38.666616 13160 typesense_server.cpp:152] Typesense 29.0.rc17 is terminating abruptly.
k
Ok I will investigate and get back to you.
Do you have vector fields in your schema?
m
Yes we do in some collections, and also reference fields too
k
Ok
If it crashed on restart, would you be able to zip that directory and share with me again?
Unlike run-time crashes, crashes during a restart are very predictable and should be easy to debug.
As long as the copy of the data directory is done before the next snapshot runs in ~60 mins after restart.
a
yep, will do that now
Also looks like we dumped a core
Copy code
coredumpctl list
TIME                          PID UID GID SIG     COREFILE     EXE                       SIZE
Thu 2025-05-01 07:19:47 UTC 13131   0   0 SIGSEGV inaccessible /usr/bin/typesense-server  n/a
(I think)
It did restart successfully on the second attempt FYI
actually no, that’s not right - tweaked some config so hopefully we get a file on next crash (thanks chatgpt)
will share the data dir with you soon
k
Since it's getting stuck on restart I'm pretty sure I can reproduce it.. when a write causes a crash we try to skip it on the next restart. Which is why it could be working on the second restart
So when I debug, I'll disable this skipping behavior to trigger the crash consistently.
a
file is shared 🙂
k
Thanks. I will check it out. It will take a few hours to get to the bottom of it.
Found the issue. This is a different issue and is unrelated to the crash. There is a race condition when related collections are updating each other's fields. This causes a race condition that sometimes leads to deadlock. That's why it works sometimes on restart. We are working on a fix.
a
Nice, thanks - I ended up rolling this back to v28 ,clearing the data to get the server to restart 🙂
but can try a new version tomorrow and hopefully capture a coredump of the crash
👍 1
k
Thanks, appreciate the help.
a
Think I have a couple of coredump files too now .. want me to share these?
Copy code
ubuntu@prod-typesense2:~$ ls -l /var/lib/systemd/coredump
total 906416
-rw-r--r-- 1 root root 479184660 May  1 07:19 core.typesense-serve.0.ab330c872f2e49898a067a7f02940c64.13131.1746083978000000.zst
-rw-r--r-- 1 root root 448965264 May  1 07:55 core.typesense-serve.0.ab330c872f2e49898a067a7f02940c64.14209.1746086125000000.zst
one might be the crash on startup, but I think the other was the crash we saw running in prod
k
No let's first fix this issue we found. We can then look at how it behaves after that.
a
ok all good, in theory we’ll capture coredumps if the new version crashes from now on 🙂
1
Thanks for being so responsive on this, I’m signing out for the night and can help track any issues down tomorrow with an updated build 👋
k
Good night!
a
Keen to know how you got on once you are online. We’ve had some more thread pool issues even with increasing the limit .. and get into the race condition with the restart which is fun 😐
Think this is the race condition - I can’t get this server to come online. It’s stuck and have tried killing the process, stopping/starting, stopping/deleting data/starting and no luck unfortunately
Copy code
E20250501 23:52:13.640734  2593 raft_server.cpp:758] 2029 lagging entries > healthy read lag of 1000
E20250501 23:52:13.640801  2593 raft_server.cpp:770] 2029 lagging entries > healthy write lag of 500
I20250501 23:52:14.640938  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652090, committed: 32652090, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:52:22.641407  2593 raft_server.cpp:758] 2043 lagging entries > healthy read lag of 1000
E20250501 23:52:22.641503  2593 raft_server.cpp:770] 2043 lagging entries > healthy write lag of 500
I20250501 23:52:24.641702  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652104, committed: 32652104, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:52:31.642194  2593 raft_server.cpp:758] 2043 lagging entries > healthy read lag of 1000
E20250501 23:52:31.642263  2593 raft_server.cpp:770] 2043 lagging entries > healthy write lag of 500
I20250501 23:52:34.642511  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652104, committed: 32652104, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:52:40.642927  2593 raft_server.cpp:758] 2076 lagging entries > healthy read lag of 1000
E20250501 23:52:40.642990  2593 raft_server.cpp:770] 2076 lagging entries > healthy write lag of 500
I20250501 23:52:44.643304  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652137, committed: 32652137, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:52:49.643666  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:52:49.643735  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:52:54.644065  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:52:58.644340  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:52:58.644420  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:53:04.644853  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:53:07.645090  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:53:07.645167  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:53:10.053274  2594 batched_indexer.cpp:422] Running GC for aborted requests, req map size: 24
I20250501 23:53:14.645665  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:53:16.645853  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:53:16.645912  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:53:24.646462  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:53:25.646586  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:53:25.646653  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:53:34.647258  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:53:34.647316  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:53:34.647337  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
E20250501 23:53:43.647897  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:53:43.647960  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:53:44.648094  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:53:52.648653  2593 raft_server.cpp:758] 2117 lagging entries > healthy read lag of 1000
E20250501 23:53:52.648728  2593 raft_server.cpp:770] 2117 lagging entries > healthy write lag of 500
I20250501 23:53:54.648919  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652179, committed: 32652179, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
I’ve grabbed a copy of the data dir just in case
Ok, it came back up but I’m guessing this is after the leader did a snapshot .. so in theory when this race happens I can force the leader to snapshot with an API call and then clear data and re-start the sever and it’ll be ok (that’s my theory)
One more thing to note, we’ve only had threadpool issues this week and continue to have them, We’ve only made one change earlier this week to use this new flag so are also going to revert this and see how the cluster behaves FYI (our traffic hasn’t increased, and we’re now setting threadpool to 128 but still exhausting it (yet CPU on the servers are ~4%) https://typesense.org/docs/28.0/api/joins.html#asynchronous-references
traffic has been very steady for us over the last 2 weeks for reference.
k
The fix is not trivial so it might take a few days to fix it. The problem is happening because the fields in the collection are making cyclic references and the current implementation is not fully foolproof against this. So this ends up causing deadlock if two writes of two such cyclically referring collections are updated at the same time.
a
do you know which fields/collections? We can take a look and see if we can make a change for this 🙂
k
Here is one example:
Copy code
glassons-com:product:embeddings -> glassons-com:products:sg
glassons-com:products:sg -> glassons-com:product:embeddings
m
In ^this setup ,only the embeddings collection has a reference field to products, and we use it to search similar products via JOIN. How does the cyclic reference happen?
We can share schemas/queries if needed
k
Sorry, the cycle I mentioned is an artifact of how we have to do a back reference / link even though it's a one-way reference.
m
Oh, I see, thanks for the clarification! Is there anything we can do on our end other than dropping the reference field?
k
No, I think we need to remove this link on our end or atleast make it not deadlock in the way it does currently. Hopefully we will have a fix by EOD.
🙌 2
a
Nice, thanks - did you see the note about the change we added earlier this week which we’ve reversed. Hoping this will also stop the other issues we’ve been seeing (and the only customer we have that doesn’t load data properly after a restart had this set so also hoping their data returns as normal if we do see another issue)
Copy code
"async_reference": true
k
async_reference
just allows you to make sure that you don't have to follow indexing order when indexing related collections. I'm not sure how that could lead to thread starvation. Perhaps the deadlock is just using up the threads. We will wait for this fix to land for further assessment.
👌 1
a
Thanks!
k
I've published a patch in
29.0.rc18
which I have just pushed.
a
Thanks, do you think the async change could have affected threadpools - we’ve had 0 issues since recreating the embedding collection without this setting. We can test this version on the server we have with coredumps enabled potentially in case we see more crashes but aware we’ve also changed our data structure 🙂
k
The bug that caused race conditions could have caused the threads to hang leading to thread starvation.
a
Ok, thanks - that makes sense 🙂
Just to summarise and so we’re really clear (we’re cautious about upgrading for a bit) there were multiple issues we were seeing. 1. v28 had a bug with async collections that when restarted caused the collection to return inconsistent results (sometimes blank) 2. v28 also had a bug with async references that caused a race condition that starved the threadpool (this was the main issue we faced last week) 3. v29 when we upgraded had a crash issue with certain data sets, we hit this as well and I think this was fixed with a revert of a PR that was to do with filtering nested objects (was this the regex issue @Jason Bosco referred to?) We’ve since deleted the reference collection we had that we’d set async references on, this seems to have made v28 stable again for us. In theory we can also now upgrade to v29RC as this has also fixed the crash bug we were seeing? Apologies for questions, but we had a bit of downtime this week that we’re not keen to replicate as you can imagine 🙂
k
Async references is a new feature and some of these issues show up only on very high concurrent reads and writes, as is the case with your deployment. It's impossible to uncover these cases without some real-world battle testing. Our goal is to fix them pro-actively. The crash issue in v29 is a totally different issue and is triggerred only when you have a filter condition which is really large, like having large number of IDs for deletion etc. The regex on such a large filter string caused high cpu load and stack smashing. The reverted PR has been now restored along with the correct fix. This particular regex related issue does not exist on v28 since that feature does not exist in v28.
a
Thanks, I appreciate the work you guys have put in to these updates 🙂 We’ll upgrade one of our prod machines to the latest RC and see how it behaves!
We’ve deployed that new RC on one server in production 🙂 we’ll keep an eye on it and let you know how it goes!
k
In general updating only one of the nodes in a HA cluster will help catch these issues.
1
a
That server has stopped responding for us, will restart it now
Logs show that it stalls on updates
Copy code
I20250503 06:23:23.337743 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753082, committed: 32753082, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:23:33.343246 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753111, committed: 32753111, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:23:43.348913 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753111, committed: 32753111, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:23:53.355930 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753112, committed: 32753112, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:24:03.361439 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753129, committed: 32753129, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:24:13.366940 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753129, committed: 32753129, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:24:23.373948 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753129, committed: 32753129, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:24:33.379421 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753129, committed: 32753129, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:24:43.384934 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753135, committed: 32753134, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
k
Ok, can you please share data dir again. I suspect there are still lurking race conditions.
a
👌 give me a few mins
👍 1
shared now 🎉
k
Thanks, will investigate and get back to you.
Btw, did restart work? Or it is hanging only during live indexing? Earlier it used to hang on restart also, what's the behavior now?
a
I rolled this back to v28 for restart (I think) but we’ve also removed the async reference collection we had
If you need the dir before a restart I can go back to v29RC and see if it stalls again
then get the dara dir after the server is stopped before restart?
k
Yes, I am running with this dataset and it's not hanging
a
ok all good. so re-install v29RC - when it stalls, stop it and send the data without restart?
k
No, the thing is only if the server again blocks on restart we can reproduce that and debug locally. If it hangs but recovers on restart it means that the race condition is not reproducible on restart, which is what we need for debugging.
a
ok, so if it hangs try a restart and if it hangs send the data?
k
Correct
👌 1
a
Looks like we had a crash but the process restarted and recovered ok .. I’ve got a coredump, want me to share this?
^ Have shared it with you just in case
👍 1
k
Crash without a stack trace?
a
yep, no stack trace
Copy code
I20250503 08:34:00.164059 19804 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32758408, committed: 32758408, known_applied: 32758408, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 350152875
I20250503 08:34:10.169531 19804 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32758422, committed: 32758422, known_applied: 32758422, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 350153090
Log file created at: 2025/05/03 08:34:20
Running on machine: prod-typesense2
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20250503 08:34:20.575286 20378 typesense_server_utils.cpp:353] Starting Typesense 29.0.rc19
I20250503 08:34:20.575393 20378 typesense_server_utils.cpp:356] Typesense is using jemalloc.
will check syslog for anything more
syslog shows
Copy code
May  3 08:34:10 prod-typesense2 typesense-server[19338]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/370/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
May  3 08:34:11 prod-typesense2 systemd[1]: Started Process Core Dump (PID 20370/UID 0).
May  3 08:34:19 prod-typesense2 systemd-coredump[20371]: Process 19338 (typesense-serve) of user 0 dumped core.#012#012Found module /usr/bin/typesense-server with build-id: 32307edc4d4afaec#012Found module /usr/bin/typesense-server with build-id: 32307edc4d4afaec#012Found module /usr/bin/typesense-server with build-id: 32307edc4d4afaec#012Stack trace of thread 19508:#012#0  0x0000be315c6337b0 edata_list_inactive_remove (/usr/bin/typesense-server + 0x48d37b0)#012#012Stack trace of thread 19347:#012#0  0x0000e5130caf9de8 n/a (n/a + 0x0)#012#1  0x0000e5130caf9dc8 n/a (n/a + 0x0)
May  3 08:34:20 prod-typesense2 systemd[1]: systemd-coredump@1-20370-0.service: Deactivated successfully.
May  3 08:34:20 prod-typesense2 systemd[1]: systemd-coredump@1-20370-0.service: Consumed 6.583s CPU time.
May  3 08:34:20 prod-typesense2 systemd[1]: typesense-server.service: Main process exited, code=dumped, status=11/SEGV
May  3 08:34:20 prod-typesense2 systemd[1]: typesense-server.service: Failed with result 'core-dump'.
May  3 08:34:20 prod-typesense2 systemd[1]: typesense-server.service: Consumed 3min 32.561s CPU time.
May  3 08:34:20 prod-typesense2 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 2.
May  3 08:34:20 prod-typesense2 systemd[1]: Stopped Typesense Server.
May  3 08:34:20 prod-typesense2 systemd[1]: typesense-server.service: Consumed 3min 32.561s CPU time.
May  3 08:34:20 prod-typesense2 systemd[1]: Started Typesense Server.
May  3 08:34:20 prod-typesense2 typesense-server[20378]: Log directory is configured as: /var/log/typesense
k
Not useful, seems to have crashed while trying to print the actual crash trace. I will check the core dump. That should have all the details.
a
all good, thanks - we’ll leave this running for now as it seemed to recover ok after the crash 🙂
👍 1
k
You are on ARM right?
a
yep, we host in AWS on graviton
👍 1
k
Unfortunately the core dump file seems corrupted. Oh well.
a
oh damn 😕
k
I know it's late so it's not urgent, but would you be able to restore the earlier
typesense.tgz
file that was shared? I deleted it but I think I wanted to revisit that data again.
a
oh, I might have deleted it locally .. let me see if I have it on the server still and will re-share
k
Google Drive says it's in your trash 🙂
a
hah, restored - can you see it now?
k
Works, thanks.
👌 1
An update: I've identified and fixing a few race conditions by instrumenting the code with a memory profiler. I will ping once we have a build that fixes those issues.
a
Awesome. Thanks for the update!
How are you getting on?
k
I've a patch that fixes the issues that showed up in the profiler. It's a large change so going through some additional testing now.
a
Nice, thanks for the update - we’re happy to test it on a node once ready 🙂
k
I've just kicked off a
29.0.rc20
build which fixes the deadlock issues flagged by the profiler. Will be available in about 45 mins.
a
Thanks! We’ll deploy this to a node today and see how it behaves
We’ve updated a node and have it serving production traffic now - will let you know how we get on 🙂
We just had a crash with the new RC - it then looks to have crashed on restart (but the second restart recovered ok) TS log
Copy code
I20250507 23:44:50.438378 58573 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 33079635 last_index: 33079749
I20250507 23:44:50.438529 58573 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 33079750 last_index: 33079898
I20250507 23:44:50.438727 58573 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 33079899 last_index: 33080665
I20250507 23:44:50.439497 58573 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 33080666 last_index: 33080875
I20250507 23:44:50.439740 58573 log.cpp:1112] load open segment, path: /var/lib/typesense/state/log first_index: 33080876
I20250507 23:44:50.440557 58621 raft_server.cpp:644] on_snapshot_load
I20250507 23:44:50.441053 58621 store.cpp:246] rm /var/lib/typesense/analytics/db success
I20250507 23:44:50.441179 58621 store.cpp:256] copy snapshot /var/lib/typesense/state/snapshot/snapshot_00000000000033078638/analytics_db_snapshot to /var/lib/typesense/analytics/db success
I20250507 23:44:50.441195 58621 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/analytics/db
I20250507 23:44:50.448840 58621 store.cpp:270] DB open success!
I20250507 23:44:50.465725 58621 store.cpp:246] rm /var/lib/typesense/db success
I20250507 23:44:50.465922 58621 store.cpp:256] copy snapshot /var/lib/typesense/state/snapshot/snapshot_00000000000033078638/db_snapshot to /var/lib/typesense/db success
I20250507 23:44:50.465937 58621 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/db
I20250507 23:44:50.483794 58621 store.cpp:270] DB open success!
I20250507 23:44:50.483817 58621 raft_server.cpp:598] Loading collections from disk...
I20250507 23:44:50.483824 58621 collection_manager.cpp:385] CollectionManager::load()
I20250507 23:44:50.484022 58621 auth_manager.cpp:35] Indexing 81 API key(s) found on disk.
I20250507 23:44:50.484400 58621 collection_manager.cpp:421] Loading upto 8 collections in parallel, 1000 documents at a time.
I20250507 23:44:50.485163 58621 collection_manager.cpp:430] Found 140 collection(s) on disk.
Syslog
Copy code
May  7 23:44:49 prod-typesense3 typesense-server[57199]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/370/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
May  7 23:44:50 prod-typesense3 systemd[1]: typesense-server.service: Main process exited, code=killed, status=11/SEGV
May  7 23:44:50 prod-typesense3 systemd[1]: typesense-server.service: Failed with result 'signal'.
May  7 23:44:50 prod-typesense3 systemd[1]: typesense-server.service: Consumed 2min 17.612s CPU time.
May  7 23:44:50 prod-typesense3 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 1.
May  7 23:44:50 prod-typesense3 systemd[1]: Stopped Typesense Server.
May  7 23:44:50 prod-typesense3 systemd[1]: typesense-server.service: Consumed 2min 17.612s CPU time.
May  7 23:44:50 prod-typesense3 systemd[1]: Started Typesense Server.
May  7 23:44:50 prod-typesense3 typesense-server[58109]: Log directory is configured as: /var/log/typesense
May  7 23:45:18 prod-typesense3 typesense-server[58109]: E20250507 23:45:18.804023 58587 raft_server.cpp:767] 2658 lagging entries > healthy read lag of 1000
May  7 23:45:18 prod-typesense3 typesense-server[58109]: E20250507 23:45:18.804445 58587 raft_server.cpp:779] 2658 lagging entries > healthy write lag of 500
May  7 23:45:51 prod-typesense3 typesense-server[58109]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/370/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
May  7 23:45:52 prod-typesense3 systemd[1]: typesense-server.service: Main process exited, code=killed, status=11/SEGV
May  7 23:45:52 prod-typesense3 systemd[1]: typesense-server.service: Failed with result 'signal'.
May  7 23:45:52 prod-typesense3 systemd[1]: typesense-server.service: Consumed 1min 23.020s CPU time.
May  7 23:45:52 prod-typesense3 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 2.
May  7 23:45:52 prod-typesense3 systemd[1]: Stopped Typesense Server.
May  7 23:45:52 prod-typesense3 systemd[1]: typesense-server.service: Consumed 1min 23.020s CPU time.
May  7 23:45:52 prod-typesense3 systemd[1]: Started Typesense Server.
May  7 23:45:52 prod-typesense3 typesense-server[58766]: Log directory is configured as: /var/log/typesense
I’ll revert this back to v28 - it looks like these are quite regular
Copy code
I20250508 00:19:49.879379 60495 raft_server.cpp:605] Finished loading collections from disk.
I20250508 00:22:14.601760 61220 raft_server.cpp:605] Finished loading collections from disk.
I20250508 00:29:32.103709 61823 raft_server.cpp:605] Finished loading collections from disk.
I20250508 00:34:36.238291 62509 raft_server.cpp:605] Finished loading collections from disk.
I20250508 00:54:33.855804 63270 raft_server.cpp:605] Finished loading collections from disk.
k
Crashed on restart
Did you happen to take a data directory copy here?
a
I didn’t unfortunately, but could re-install and see if we can replicate it. Do I stop TS if this happens and compress the data .. or let it start successfully and then compress?
k
Yes the crucial things is to have a data directory copy when Typesense fails on restart because that's when I can re-run and reproduce the issue. So the first crash during runtime is not useful. If it crashes on restart, then yes please grab the data directory then.
Core dump is also useful btw.
Can you check if you have the core dump of the crash?
Oh but wait, I had trouble loading it. Sorry I forgot that.
a
ok, all good
k
Atleast that particular coredump.
a
I’ll re-install and see if we see the same crash on restart
👍 1
This is a different server so I’ll need to configure core dumps. our :LEADER node always seems to switch between what we call prod1 and prod2 .. so we’re using prod3 for this install as it never seems to become LEADER so feel a bit more comfortable about it.
So just to clarify. If we see a crash on startup after another crash - stop TS and then compress the data dir
k
Correct.
This can be done within 1 hour of the second crash. Because the default snapshot interval is set to 1 hour.
1
a
Ok, we had a crash and this time it didn’t recover - stopping the process and will send the data over
Copy code
E20250508 04:33:42.934588 66918 raft_server.cpp:767] 1607 lagging entries > healthy read lag of 1000
E20250508 04:33:42.934661 66918 raft_server.cpp:779] 1607 lagging entries > healthy write lag of 500
I20250508 04:33:45.934913 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33102116, committed: 33102116, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:33:51.935331 66918 raft_server.cpp:767] 1650 lagging entries > healthy read lag of 1000
E20250508 04:33:51.935388 66918 raft_server.cpp:779] 1650 lagging entries > healthy write lag of 500
I20250508 04:33:55.935705 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33102161, committed: 33102161, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:34:00.936070 66918 raft_server.cpp:767] 1676 lagging entries > healthy read lag of 1000
E20250508 04:34:00.936142 66918 raft_server.cpp:779] 1676 lagging entries > healthy write lag of 500
I20250508 04:34:05.936522 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33102185, committed: 33102185, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:34:09.936829 66918 raft_server.cpp:767] 1676 lagging entries > healthy read lag of 1000
E20250508 04:34:09.936900 66918 raft_server.cpp:779] 1676 lagging entries > healthy write lag of 500
k
and this time it didn’t recover
When you mean by it didn't recover, is it just stuck?
a
yep, unhealthy state lagging entries is just growing
k
Ok, and this is on a restart?
a
I’m comopressing the data now - yes, after the restart. Although it did look like it was healthy for a while, then got into this state.
I’ll get a better look at the logs shortly
👍 1
k
What I wish to know is whether after restart it did catch up (i.e. recovered successully) and then got stuck (because of live writes) or whether it never got healthy. Hopefully it's the latter.
a
Actually no sorry, it looks like the flow was Install RC20, what I thought was a crash and indexes loading was actually just the startup of RC20 - so this appears to have been running ok until it just started lagging, then never recovered
Copy code
I20250508 04:09:55.427745 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100812, committed: 33100811, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:05.433208 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100861, committed: 33100861, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:14.331883 66919 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 0
I20250508 04:10:15.438922 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100882, committed: 33100882, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:21.095005 66949 log.cpp:536] close a full segment. Current first_index: 33100094 last_index: 33100892 raft_sync_segments: 0 will_sync: 1 path: /var/lib/typesense/state/log/log_00000000000033100094_00000000000033100892
I20250508 04:10:21.095139 66949 log.cpp:550] Renamed `/var/lib/typesense/state/log/log_inprogress_00000000000033100094' to `/var/lib/typesense/state/log/log_00000000000033100094_00000000000033100892'
I20250508 04:10:21.095188 66949 log.cpp:114] Created new segment `/var/lib/typesense/state/log/log_inprogress_00000000000033100893' with fd=70
I20250508 04:10:25.446055 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100910, committed: 33100910, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:35.451594 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100910, committed: 33100910, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:45.457027 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100929, committed: 33100929, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:55.464278 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100930, committed: 33100930, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:05.469852 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100930, committed: 33100930, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:15.335372 66919 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 0
I20250508 04:11:15.476284 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100932, committed: 33100932, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:25.482952 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100933, committed: 33100933, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:35.488318 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100970, committed: 33100970, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:45.493450 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100970, committed: 33100970, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:55.500308 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33101013, committed: 33101013, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:11:57.500486 66918 raft_server.cpp:779] 504 lagging entries > healthy write lag of 500
I20250508 04:12:05.505687 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33101014, committed: 33101014, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:12:06.505896 66918 raft_server.cpp:779] 506 lagging entries > healthy write lag of 500
I20250508 04:12:15.511322 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33101015, committed: 33101015, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:12:15.511467 66918 raft_server.cpp:779] 506 lagging entries > healthy write lag of 500
I20250508 04:12:16.339378 66919 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 0
E20250508 04:12:24.517084 66918 raft_server.cpp:779] 506 lagging entries > healthy write lag of 500
I202
I have the data dir from the last logs if you want it? Or can restart and see if it crashes (in which case I still have the data as well)
k
Yes let's focus on the crash on restart issue first. Because that's the easiest to debug..
a
ok, I’ll restart and see what happens
k
But you can also share this data directory away.
1
a
The server started up ok this time round
Shared the directory
I need to head off so will roll back to v28 for now and try v29RC again later when I can watch the process.
👍 1
k
There is one other thing you can try tomorrow. Enable this server side flag:
Copy code
enable-search-logging
This will log search queries before they hit the processing code. This way, we can very quickly find out if a particular type of query is causing a crash. Maybe this time it's not a data indexing issue, so restoring snapshots in a debug environment might not help at all because there are no search queries.
a
I’ve done this and gone through a couple of crashes. I’ll DM you the logs of the 20 searches logged before each crash
👍 1
k
I've tried running the queries but it works fine. 1. Are you using a GPU for the embedding? 2. Would it be possible for you to disable searches alone (by blocking the http(s) port) on the node with the RC build. This will allow us to find out if the issue is related to writes or whether searches are also involved. 3. Finally, is it possible for you guys to use Typesense Cloud for a few days so we are able to get better instrumentation and control over the logs and crash? I have tried my best to pin this issue down, but since it's so elusive I suspect I need an environment with full control where we can basically debug it live. We will run 2 nodes with the stable v28 GA but one node on the RC build, and we have better instrumentation and full logs to get to the bottom of it. I hope you will consider this proposal.
a
1. We’re not - these are just plain Graviton machines 2. We can upgrade the server and remove it from the target group on the load balance, will that achieve the same thing? 3. Not easily unfortunately - they way we are architected this would be a pretty massive change for us (and not realistic for our production sites). Is there more telemetry I can install on the problem server that will help?
k
For 2, no you have to block port 80 or whatever http typesense port you have configured using iptables. For 1, what exactly graviton instance? What's the run time memory used? Let's first focus on disabling http traffic to isolate search traffic, which will give us more hints.
E.g. sudo iptables -A INPUT -p tcp --dport 80 -j DROP
To undo sudo iptables -D INPUT -p tcp --dport 80 -j DROP
a
I think removing the instance from the load balancer will achieve the same thing. We listen in 8107 (traffic) and 8108 (replication) from memory so removing it will stop all traffic to 8107.
k
Are you sure it's not 8107 for replication and 8108 for http? That's the default.
a
Our instances are r8g.large. 16GB Ram and we’re using about 4GB
k
Also, how long does a restart take for the node to typically catch up?
a
Yeah maybe that. So s long as we stop traffic to everything apart from the relocation port we’re ok?
k
Yes I think it should be fine.
a
Catch up is 5-10seconds.
k
But wait
a
It’s quick. Our data sets are small 😊
k
The leader also talks to the follower via the http port for some status checks.
a
Oh ok. So we want to block that as well?
k
What is your rough search concurrency, in terms of searches per second?
The
iptables
approach is the simplest. Literally one line command to enable / disable.
a
I can work this out but I’d estimate 50-60/sec based on the load balancer graphs.
k
Ok, an estimate is fine
And the crash happens pretty fast?
a
Normally we crash within 20 mins then it’s regular - every 10 mins or so
k
Did you also check that somehow you are not running out of memory also?
a
If there is some debug telemetry I can run we’re happy to do this
k
Let's first try this http traffic blocking first.
a
I’ll check for memory leaks next time we update to the RC. Can do tonight
But we have a lot of headroom
k
Yes so, for the next round: 1. Restart and then block http port 2. Monitor memory and wait for crash This will clearly tell us whether the problem is in search path or indexing path.
Based on this, I can produce a debug build for you with some traces. But I think it will be super helpful to debug on x86 rather than ARM.
a
Can I mix servers. It’s not hard for us to add an x86 in as this server for testing.
k
Yes, as long as the node we are testing with the RC build is x86, that's all we need. We also use both ARM and x86 and often have mixed clusters also.
a
Ok cool. I’ll switch this one to x86. Enable core dumps and try as above.
k
Thank you for the help!
❤️ 1
a
I still haven’t had a chance to set this server up, will try and get to it today to some stage with the new RC
k
Meanwhile I found another issue related to group by which we identified with another customer. Can you please try with 29.0.rc21?
a
Yes, I can try this shortly.
Unrelated (maybe related) we just had one of our v28 servers crash - no stack unfortunately
Copy code
I20250514 02:25:34.280957  5178 raft_server.h:60] Peer refresh succeeded!
Log file created at: 2025/05/14 02:25:42
Running on machine: prod-typesense1
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20250514 02:25:42.691740 28480 typesense_server_utils.cpp:348] Starting Typesense 28.0
I20250514 02:25:42.691789 28480 typesense_server_utils.cpp:351] Typesense is using jemalloc.
I20250514 02:25:42.691942 28480 typesense_server_utils.cpp:413] Thread pool size: 128
I20250514 02:25:42.699847 28480 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/db
I20250514 02:25:42.732021 28480 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/meta
I20250514 02:25:42.752151 28480 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/analytics/db
I20250514 02:25:42.765798 28480 ratelimit_manager.cpp:546] Loaded 0 rate limit rules.
I20250514 02:25:42.765830 28480 ratelimit_manager.cpp:547] Loaded 0 rate limit bans.
I20250514 02:25:42.766219 28480 typesense_server_utils.cpp:578] Starting API service...
I20250514 02:25:42.766376 28480 http_server.cpp:180] Typesense has started listening on port 8108
I20250514 02:25:42.767145 28943 typesense_server_utils.cpp:521] Conversation garbage collector thread started.
I’ll install RC21 on this server now and we’ll test it
We had a crash with this version that stored a coredump, I’m a bit out of my depth with these but have tried to ChatGPT it. Let me know if there is anything you want me to try?
Copy code
warning: Error reading shared library list entry at 0x253c00746473
Failed to read a valid object file image from memory.
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/bin/typesense-server --config=/etc/typesense/typesense-server.ini'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  CountMinSketch::~CountMinSketch (this=0xfb1f1de64240, __in_chrg=<optimized out>) at include/count_min_sketch.h:48
48	include/count_min_sketch.h: No such file or directory.
[Current thread is 1 (LWP 30151)]
(gdb) bt full
#0  CountMinSketch::~CountMinSketch (this=0xfb1f1de64240, __in_chrg=<optimized out>) at include/count_min_sketch.h:48
        i = 0
        i = <optimized out>
        i = <optimized out>
Backtrace stopped: Cannot access memory at address 0xfb1fd46878f8
We just had another crash, it definitely feels less frequent than before and is now recovering itself pretty quickly. I’m unsure if we should keep this version running or rollback to v28?
k
Would you be able to share the core dump? It's on x86 right?
a
It’s not yet, but I can change this server out and keep this version running.
k
Yes let's do that. X86 is easier to debug for me locally.
Also please enable search logging again.
Also doing that iptables exercise is crucial to find out if it's a search or indexing related.
a
Is it simpler to replace a server or add a 4th into the cluster?
k
Both will require updating IP address file on existing nodes.
--nodes
Adding a new one is probably easier
a
I think I can force an AWs machine to use an IP .. so simpler might be shut-down a node. Force the IP to be the same and start it up with the TS config from the shut-down node. Will try that first
👍 1
Ok, I have a 4th server in now (same OS, amd64), search logging is enabled + coredumps. Will see how this goes and then we can block ports if it crashes 🙂
k
👍 awesome
a
This version crashed, I’ll share the dump shortly
Copy code
total 457180
-rw-r----- 1 root root 468140190 May 14 05:23 core.typesense-serve.0.642bf0ce7ce942c39be5e9d0a416132b.4856.1747200173000000.zst
👍 1
Done!
k
Query logs as well?
a
how many before the crash do you need?
k
Last 1 minute to be safe. I plan to run all the queries through this time under concurrency to check. Assuming it's a search issue.
Will be starting on this investigation in an hour or so.
a
Thanks, I’ll just compress the entire log - it’s small. Interesting the query before the crash is a site we use group_by.
crash is around line 14781
Log on it;s way
I’ll leave this server running - less nervous as the other 3 are stable 🙂
k
Finally managed to reproduce crash locally!
🙌 2
a
That’s awesome news!
k
Should be able to provide a fixed build by tomorrow.
a
Cool, I’ll leave this server running but drop it out of the load balancer so it stops getting traffic overnight. We can install an updated build and test production traffic again on it tomorrow
🙌 1
k
29.0.rc22
published with the patch.
a
Thanks! This is installed and back in production - will let you know how it goes 🙂
So far it’s looking good - 1.5 hours into the new release and no crash 🎉
2.5 hours - still stable wahoo
6.5 hours now, no crash - I’d consider this fixed 😄
k
Cool, thank you for all the help. It was a super edge case bug. It requires a group by that involves a record that has an empty value in one of the array values.
a
No problem at all, thanks for persisting 🙂
I’ll remove the 4th server from our cluster tonight if we see no issues. Is this the preferred way to do that? 1. Shutdown TS on the 4th server 2. Update nodes.ini in the 2 remaining followers, restart these and wait for them to be ready 3. Update nodes.ini on the leader, restart and wait. Hoping this will be the simplest approach?
k
Don't have to restart the nodes. You can just shut down the 4th node and then remove the node from the nodes.ini file
a
Oh really, that’s cool
same for adding a node, just add to nodes.ini?
k
Yes
🙌 1