After this issue <https typesense community slack com archiv typesense #community-help

After this issue <https://typesense-community.slac...

Alain Russell

04/28/2025, 10:58 PM

After this issue https://typesense-community.slack.com/archives/C01P749MET0/p1745807284729069 we seem to have some data that is out of sync and I’m unsure why (or how to sort the issue past what I’ve tried) Steps I’ve done: 1. Removed the TS server from our target group to prevent traffic hitting it 2. Stopped the TS process 3. Deleted all data in the

/var/lib/typesense

directory 4. Restarted the server After this happens, the TS server in question has more documents than the 2 other nodes we have which I’m not sure how this happens. We’re having an issue with a collection returning blank data (we’re debugging this more now). How does the restored data have more records than the nodes it restored from? When we export the collection from each node the node1 data file does have 3 extra JSON documents in it which aren’t in the other 2 nodes? Have missed something in the data delete process? Thanks!

Alain Russell

04/28/2025, 10:59 PM

These servers are on v28 (release version)

Max Priazhevskii

04/29/2025, 12:07 AM

To add more context - this simplified “filter by” query returns no results on the recovered node1, but does return results on the other two:

Copy code

curl "${TYPESENSE_BASE_URL}/collections/products/documents/search?q=*&filter_by=categoryids:=11" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"

Node1:

Copy code

{
    "facet_counts": [],
    "found": 0,
    "hits": [],
    "out_of": 2190,
...

Other nodes:

Copy code

"facet_counts": [],
  "found": 60,
  "hits": [...],
  "out_of": 2187,
...

ps. other filter_by queries work return some results on the node1 but counts don’t match

Jason Bosco

04/29/2025, 12:14 AM

Are you using JOINs by any chance?

Max Priazhevskii

04/29/2025, 12:15 AM

we do for other collections/queries, but not in this one

Jason Bosco

04/29/2025, 12:16 AM

Is this collection referenced by another collection though?

Jason Bosco

04/29/2025, 12:16 AM

Even if it doesn't have a reference field inside itself

Max Priazhevskii

04/29/2025, 12:17 AM

yep, an embedding collection referencing this one

Jason Bosco

04/29/2025, 12:18 AM

This sounds similar to an issue we fixed in recent RC builds. So you might want to try upgrading to it. That said, there's one known issue with the recent RC build, where large multi_search requests end up causing a crash, which we're still fixing

Max Priazhevskii

04/29/2025, 12:20 AM

Right, good to know! So what is the preferred action here, to wait for the crash fix and then upgrade? We’ve currently taken node1 out of the LB so it doesn’t get traffic. But we’re also careful with rolling RCs to prod.

Jason Bosco

04/29/2025, 12:22 AM

The multi_search issue doesn't seem to affect all users though, only select data shapes. So you might want to try upgrading 1 out of 3 nodes to the new RC build to see if the issue occurs with your dataset, after letting it run for 24-48 hours, then upgrade the next node, and so on

Max Priazhevskii

04/29/2025, 12:23 AM

ok great, we’ll try that - thanks for your quick reply!

👍 1

Alain Russell

04/29/2025, 12:28 AM

Thanks @Jason Bosco should we upgrade the server that is returning no results (it’s a follower) if we’re picking one?

Jason Bosco

04/29/2025, 12:30 AM

Yeah, I'd recommend a follower

Alain Russell

04/29/2025, 12:31 AM

Thanks, and do we delete the data dir as part of this or let it use the data there?

Jason Bosco

04/29/2025, 2:04 AM

You should be able to re-use the existing data dir

Alain Russell

04/29/2025, 2:04 AM

Thanks, we did this and put the server back into the LB .. but it crashed, so guessing our data is a use case for the crash you are fixing 🙂

Kishore Nallan

04/29/2025, 2:55 AM

Can you please share the stack trace from the crash?

Alain Russell

04/29/2025, 3:02 AM

Actually looks like we didn’t log one, the server just logged as below and restarted + loaded indexes again.

Copy code

I20250429 01:34:54.736620  2764 raft_server.cpp:692] Term: 47, pending_queue: 0, last_index: 32417417, committed: 32417417, known_applied: 32417417, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 345957379
Log file created at: 2025/04/29 01:36:52
Running on machine: prod-typesense1
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20250429 01:36:52.914856  3052 typesense_server_utils.cpp:353] Starting Typesense 29.0.rc15
I20250429 01:36:52.914935  3052 typesense_server_utils.cpp:356] Typesense is using jemalloc.
I20250429 01:36:52.915086  3052 typesense_server_utils.cpp:418] Thread pool size: 16
I20250429 01:36:52.916780  3052 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/db
I20250429 01:36:52.932957  3052 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/meta
I20250429 01:36:52.939391  3052 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/analytics/db
I20250429 01:36:52.946942  3052 ratelimit_manager.cpp:546] Loaded 0 rate limit rules.
I20250429 01:36:52.946966  3052 ratelimit_manager.cpp:547] Loaded 0 rate limit bans.
I20250429 01:36:52.947021  3052 typesense_server_utils.cpp:585] Starting API service...
I20250429 01:36:52.947180  3052 http_server.cpp:180] Typesense has started listening on port 8108
I20250429 01:36:52.947865  3177 batched_indexer.cpp:196] Starting batch indexer with 16 threads.
I20250429 01:36:52.947966  3179 typesense_server_utils.cpp:528] Conversation garbage collector thread started.
I20250429 01:36:52.948277  3177 batched_indexer.cpp:201] BatchedIndexer skip_index: -9999
I20250429 01:36:52.953171  3176 server.cpp:1181] Server[braft::RaftStatImpl+braft::FileServiceImpl+braft::RaftServiceImpl+braft::CliServiceImpl] is serving on port=8107.
I20250429 01:36:52.953197  3176 server.cpp:1184] Check out <http://prod-typesense1:8107> in web browser.
I20250429 01:36:52.954855  3176 raft_server.cpp:75] Nodes configuration: 10.50.3.101:8107:8108,10.50.4.82:8107:8108,10.50.5.218:8107:8108
I20250429 01:36:52.955206  3176 log.cpp:690] Use murmurhash32 as the checksum type of appending entries
I20250429 01:36:52.955240  3176 log.cpp:1172] log load_meta /var/lib/typesense/state/log/log_meta first_log_index: 32409936 time: 25
I20250429 01:36:52.955260  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32412141 last_index: 32413175
I20250429 01:36:52.955269  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32415023 last_index: 32415991
I20250429 01:36:52.955276  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32410512 last_index: 32411010
I20250429 01:36:52.955281  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32416869 last_index: 32416875
I20250429 01:36:52.955284  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32411011 last_index: 32411431
I20250429 01:36:52.955307  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32409741 last_index: 32410511
I20250429 01:36:52.955314  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32413176 last_index: 32414250
I20250429 01:36:52.955319  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32411432 last_index: 32412140
I20250429 01:36:52.955324  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32414251 last_index: 32415022
I20250429 01:36:52.955329  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32416876 last_index: 32417138
I20250429 01:36:52.955336  3176 log.cpp:1014] restore closed segment, path: /var/lib/typesense/state/log first_index: 32415992 last_index: 32416868
I20250429 01:36:52.955343  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32409741 last_index: 32410511
I20250429 01:36:52.956046  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32410512 last_index: 32411010
I20250429 01:36:52.956537  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32411011 last_index: 32411431
I20250429 01:36:52.956930  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32411432 last_index: 32412140
I20250429 01:36:52.957600  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32412141 last_index: 32413175
I20250429 01:36:52.958635  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32413176 last_index: 32414250
I20250429 01:36:52.959641  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32414251 last_index: 32415022
I20250429 01:36:52.960390  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32415023 last_index: 32415991
I20250429 01:36:52.961262  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32415992 last_index: 32416868
I20250429 01:36:52.962060  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32416869 last_index: 32416875
I20250429 01:36:52.962085  3176 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 32416876 last_index: 32417138

Kishore Nallan

04/29/2025, 4:29 AM

Are you able to reproduce this trivially locally or on a staging env?

Alain Russell

04/29/2025, 4:36 AM

we can probably try to query the server directly and see if we can make it crash

Kishore Nallan

04/29/2025, 4:37 AM

Yeah that will be very useful. With an easy reproduction we can quickly identify and fix and affirm the fix also.

Alain Russell

04/29/2025, 4:48 AM

All good, working on it - we’ve enabled search logging and can see the query logged right before the crash so we’ll try and replicate this with a shell command and see if we can force it to crash. If we do I’ll compress the data dir and share it with the search we run 🙂

Alain Russell

04/29/2025, 4:49 AM

OOI does a search get logged before it runs, so the last search in the logs is likely the issue?

Kishore Nallan

04/29/2025, 4:49 AM

--enable-search-logging

logs query before the search request cycle. So one of the last few queries should be caught before crash (if you have high concurrency)

Alain Russell

04/29/2025, 4:50 AM

perfect, we’ll try and replicate the crash from these 🙂

👍 1

Alain Russell

04/29/2025, 10:45 AM

Nothing useful yet on this - we’ve had a couple of crashes but can’t find a pattern in the queries

Alain Russell

04/29/2025, 11:08 AM

Syslog is showing this error when we’re crashing, where does TS core-dump (can we configure something to catch this?)

Copy code

Apr 29 05:10:20 prod-typesense1 typesense-server[5343]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/585/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
Apr 29 05:10:23 prod-typesense1 systemd[1]: typesense-server.service: Main process exited, code=dumped, status=11/SEGV
Apr 29 05:10:23 prod-typesense1 systemd[1]: typesense-server.service: Failed with result 'core-dump'.
Apr 29 05:10:23 prod-typesense1 systemd[1]: typesense-server.service: Consumed 2min 2.186s CPU time.
Apr 29 05:10:23 prod-typesense1 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 2.
Apr 29 05:10:23 prod-typesense1 systemd[1]: typesense-server.service: Consumed 2min 2.186s CPU time.
Apr 29 05:10:23 prod-typesense1 typesense-server[5699]: Log directory is configured as: /var/log/typesense

Kishore Nallan

04/29/2025, 11:19 AM

Core dump must be configured. We don't really set it up as part of DEB.

Alain Russell

04/29/2025, 11:19 AM

ok, all good - we just had another crash. I’ll see if I can find a common query

👍 1

Alain Russell

04/29/2025, 11:32 AM

We’ll try and run some of these queries on our data tomorrow morning and see if we can force a crash. ChatGPT found a bunch of common queries running just before the crashes we saw (attached in case it helps)

Fuzzy_Matched_Typesense_Queries.csv

Kishore Nallan

04/29/2025, 11:33 AM

Ok 👍

Alain Russell

04/29/2025, 11:15 PM

We’ve run all these queries and can’t replicate the crash with single queries. Is it a concurrency issue? I’m not sure how we’d simulate a lot of traffic to aside from production traffic

Kishore Nallan

04/30/2025, 2:56 AM

We will have to get a core dump otherwise. We've never had an issue like this which has caused a crash without a stack trace.. the other option is to downgrade 29.0 RC builds down to a version where we can pinpoint which version causes the crash.

Kishore Nallan

04/30/2025, 2:58 AM

Nothing in /var/syslog also?

Alain Russell

04/30/2025, 3:18 AM

This shows in syslog when it crashes https://typesense-community.slack.com/archives/C01P749MET0/p1745924915555289?thread_ts=1745881111.998959&cid=C01P749MET0

Alain Russell

04/30/2025, 3:20 AM

We also see this in syslog, I think the times correlate to the crashes from yesterday btu would need to re-enable and confirm

Copy code

Apr 28 23:53:01 prod-typesense1 typesense-server[482]: E20250428 23:53:01.807976   679 core_api.cpp:633] JSON error: [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON
Apr 28 23:54:02 prod-typesense1 typesense-server[482]: E20250428 23:54:02.357002   676 core_api.cpp:633] JSON error: [json.exception.parse_error.101] parse error at line 5, column 13: syntax error while parsing object key - unexpected '}'; expected string literal
Apr 29 04:39:38 prod-typesense1 typesense-server[3052]: E20250429 04:39:38.152671  3081 core_api.cpp:843] JSON error: [json.exception.parse_error.101] parse error at line 11, column 13: syntax error while parsing object key - unexpected '}'; expected string literal
Apr 29 04:39:41 prod-typesense1 typesense-server[3052]: E20250429 04:39:41.781852  3082 core_api.cpp:843] JSON error: [json.exception.parse_error.101] parse error at line 11, column 13: syntax error while parsing object key - unexpected '}'; expected string literal

Alain Russell

04/30/2025, 3:24 AM

Do you have any ideas on which version we should try as a start?

Alain Russell

04/30/2025, 3:30 AM

@Jason Bosco can you let us know what RC version this fix was, we’ll start there and can upgrade/downgrade if we see crashes. https://typesense-community.slack.com/archives/C01P749MET0/p1745885906042689?thread_ts=1745881111.998959&cid=C01P749MET0

Kishore Nallan

04/30/2025, 3:31 AM

We don't have an RC for that yet, I will be building one shortly. I will share the build when it's done.

Alain Russell

04/30/2025, 3:32 AM

ah ok, we had some more issues today and have hit a bit of a problem. The remaining servers had threadpool problems and we’re now not returning the full data sets from the indexes as far as we can tell. Trying to work out the best way forward. 1. We don’t want to upgrade to the latest RC as this has a crashing issue 2. If we stay on v28 it seems like we have data missing from results

Kishore Nallan

04/30/2025, 3:42 AM

This new build will have that multi search issue fixed. I will be able to get this published in a couple of hours.

Alain Russell

04/30/2025, 3:48 AM

the crash?

Alain Russell

04/30/2025, 3:48 AM

We’ve got a couple of servers returning blank data now so I think we need to get them on to the latest RC as we did with the server yesterday..

Alain Russell

04/30/2025, 3:58 AM

We’ve updated to the latest RC for now - this has sorted the blank responses we were seeing. We’ll update to your new build once confirmed. Thanks for your help

Alain Russell

04/30/2025, 4:28 AM

Ok, we’re between a rock and a hard place now. We need the RC for the join bug we were seeing with v28 .. but we’re seeing crashes on RC15 every 5-10 mins currently 💥 We’ll be looking forward to the new RC 😄

Kishore Nallan

04/30/2025, 4:41 AM

Will be available in about 45 mins. I've kicked off the deployment.

🙌 1

Alain Russell

04/30/2025, 4:42 AM

You’ll be my hero for the day, I’m currently baby sitting the cluster and restarting things as needed 🔥

Alain Russell

04/30/2025, 4:54 AM

Looks like the build we need is there - ok to deploy?

Kishore Nallan

04/30/2025, 4:54 AM

Yes, you can go ahead

Kishore Nallan

04/30/2025, 4:55 AM

Hopefully this fixes the crash, if the multi-search body is indeed the root cause. Otherwise, we are back to square 1.

Alain Russell

04/30/2025, 4:56 AM

files might not be ready yet

Kishore Nallan

04/30/2025, 4:57 AM

Yeah the build is not yet over. It publishes stuff in sequence.

Kishore Nallan

04/30/2025, 4:57 AM

You need the ARM binary?

Alain Russell

04/30/2025, 4:57 AM

yep

Alain Russell

04/30/2025, 4:57 AM

29.0.rc16-arm64

Alain Russell

04/30/2025, 4:59 AM

getting them now

Kishore Nallan

04/30/2025, 4:59 AM

Yes it exists now:

Copy code

<https://dl.typesense.org/releases/29.0.rc16/typesense-server-29.0.rc16-linux-arm64.tar.gz>

Alain Russell

04/30/2025, 5:02 AM

Thanks, we’ve just deployed this now and will monitor. We will need to tink of a backup if this doesn’t work as v28 has the join bug that returns blank results for us

Alain Russell

04/30/2025, 5:07 AM

Looks like we just had a crash, but all 3 servers crashed at exactly the same time

Alain Russell

04/30/2025, 5:08 AM

Logs don’t show stack trace

Copy code

I20250430 05:06:42.392635 17604 raft_server.cpp:692] Term: 69, pending_queue: 0, last_index: 32500782, committed: 32500782, known_applied: 32500782, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 346964797
I20250430 05:06:48.530647 17605 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 0
Log file created at: 2025/04/30 05:06:50
Running on machine: prod-typesense1
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20250430 05:06:50.916981 17773 typesense_server_utils.cpp:353] Starting Typesense 29.0.rc16
I20250430 05:06:50.917088 17773 typesense_server_utils.cpp:356] Typesense is using jemalloc.
I20250430 05:06:50.917270 17773 typesense_server_utils.cpp:418] Thread pool size: 32

Kishore Nallan

04/30/2025, 5:09 AM

Sigh. Let me see if I can create a new branch from v28 GA and pick only the join related bug fixes onto that.

Alain Russell

04/30/2025, 5:09 AM

I’ll see if we got anything in syslog

Alain Russell

04/30/2025, 5:11 AM

syslog shows

Copy code

Apr 30 05:06:50 prod-typesense1 typesense-server[17399]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/585/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
Apr 30 05:06:50 prod-typesense1 systemd[1]: typesense-server.service: Main process exited, code=killed, status=11/SEGV
Apr 30 05:06:50 prod-typesense1 systemd[1]: typesense-server.service: Failed with result 'signal'.
Apr 30 05:06:50 prod-typesense1 systemd[1]: typesense-server.service: Consumed 1min 28.032s CPU time.
Apr 30 05:06:50 prod-typesense1 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 1.
Apr 30 05:06:50 prod-typesense1 systemd[1]: typesense-server.service: Consumed 1min 28.032s CPU time.
Apr 30 05:06:50 prod-typesense1 typesense-server[17773]: Log directory is configured as: /var/log/typesense
Apr 30 05:07:25 prod-typesense1 typesense-server[17773]: E20250430 05:07:25.754397 17986 raft_server.cpp:767] 5987 lagging entries > healthy read lag of 1000
Apr 30 05:07:25 prod-typesense1 typesense-server[17773]: E20250430 05:07:25.755008 17986 raft_server.cpp:779] 5987 lagging entries > healthy write lag of 500
Apr 30 05:07:30 prod-typesense1 typesense-server[17773]: E20250430 05:07:30.424016 17947 raft_server.cpp:779] 938 lagging entries > healthy write lag of 500

Alain Russell

04/30/2025, 5:15 AM

interestingly when they crash - the servers sometimes don’t recover and get caught in a loop with this message. The only fix I can find is killing the server, deleting the data and letting it reload

Copy code

I20250430 05:14:18.216732 18879 raft_server.h:294] Configuration of this group is 10.50.4.82:8107:8108,10.50.3.101:8107:8108,10.50.5.218:8107:8108
E20250430 05:14:22.778808 18839 raft_server.cpp:767] 1116 lagging entries > healthy read lag of 1000
E20250430 05:14:22.779194 18839 raft_server.cpp:779] 1116 lagging entries > healthy write lag of 500
I20250430 05:14:23.779407 18839 raft_server.cpp:692] Term: 70, pending_queue: 0, last_index: 32501193, committed: 32501193, known_applied: 32494796, applying: 32500077, pending_writes: 0, queued_writes: 137, local_sequence: 346946438
E20250430 05:14:31.779937 18839 raft_server.cpp:767] 1116 lagging entries > healthy read lag of 1000
E20250430 05:14:31.780007 18839 raft_server.cpp:779] 1116 lagging entries > healthy write lag of 500
I20250430 05:14:33.780211 18839 raft_server.cpp:692] Term: 70, pending_queue: 0, last_index: 32501193, committed: 32501193, known_applied: 32494796, applying: 32500077, pending_writes: 0, queued_writes: 137, local_sequence: 346946438
E20250430 05:14:40.780691 18839 raft_server.cpp:767] 1116 lagging entries > healthy read lag of 1000
E20250430 05:14:40.780755 18839 raft_server.cpp:779] 1116 lagging entries > healthy write lag of 500
I20250430 05:14:43.781015 18839 raft_server.cpp:692] Term: 70, pending_queue: 0, last_index: 32501193, committed: 32501193, known_applied: 32494796, applying: 32500077, pending_writes: 0, queued_writes: 137, local_sequence: 346946438
I20250430 05:14:44.237591 18840 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 137
E20250430 05:14:49.781430 18839 raft_server.cpp:767] 1119 lagging entries > healthy read lag of 1000
E20250430 05:14:49.781500 18839 raft_server.cpp:779] 1119 lagging entries > healthy write lag of 500

Alain Russell

04/30/2025, 5:16 AM

The server also doesn’t seem to stop with

sudo systemctl stop typesense-server.service

- I end up killing it

Kishore Nallan

04/30/2025, 5:18 AM

We need a core dump. These logs are not helpful unfortunately for the root cause.

Alain Russell

04/30/2025, 5:19 AM

ok, I’ll check what I can do to configure one

Kishore Nallan

04/30/2025, 5:20 AM

Meanwhile I have cherry-picked a few commits on a new branch and will kick off a debug build in a few mins.

Kishore Nallan

04/30/2025, 5:20 AM

Few reference related fixes alone.

Alain Russell

04/30/2025, 5:20 AM

ok nice, thanks - I think references is what caused our results to be blank 🙌

Kishore Nallan

04/30/2025, 5:27 AM

Meanwhile can you try

29.0.rc10

-- this should have all 3 reference related fixes I have cherry-picked.

Alain Russell

04/30/2025, 5:27 AM

ok, we’ll try that now

Alain Russell

04/30/2025, 5:32 AM

cool - tested that on one node and that returns data for us so the fix is good up to here. We’ll deploy this over the other nodes and see if we get crashes.

Alain Russell

04/30/2025, 5:32 AM

I think I’ve also configured coredumps now

👍 1

Kishore Nallan

04/30/2025, 5:35 AM

If rc10 works then we could possibly try to bisect and identify the problematic build by trying rc13, and so on on just one of the 3 nodes.

Alain Russell

04/30/2025, 5:35 AM

sounds good, we’re just deploying to the last node now

Kishore Nallan

04/30/2025, 5:47 AM

All good so far?

Alain Russell

04/30/2025, 5:52 AM

no, confirming this one is crashing as well

Alain Russell

04/30/2025, 5:52 AM

different crash though, it seems to hang and responses get slower and slower .. trying to work it out

Kishore Nallan

04/30/2025, 5:56 AM

Meanwhile the cherry-pick build is ready:

28.0.refpatch1

-- this contains only 3 reference related patches.

Alain Russell

04/30/2025, 5:56 AM

ok, nice thanks

Alain Russell

04/30/2025, 5:56 AM

we’ll dop this now

Alain Russell

04/30/2025, 6:02 AM

Ok, we’ve deployed this now but seeing the blank data again 😕 this is what the RC data fixed

Kishore Nallan

04/30/2025, 6:10 AM

Ok we are left with no choice but to either have core dump. Or I need the data dir and some 10-15 queries with which we can try running locally with some additional traces and tracking.

Kishore Nallan

04/30/2025, 6:18 AM

I can try running those queries concurrently locally

Alain Russell

04/30/2025, 6:30 AM

We’re just working out a plan to change the collections that are a problem and roll back to v28 stable

👍 1

Kishore Nallan

04/30/2025, 6:44 AM

If you are able to share the data and queries we will be able to try and reproduce the issue though.

Alain Russell

04/30/2025, 6:45 AM

We’ve just rolled back so hopefully it’s stable .. working on getting results (we might need to change the collection)

Alain Russell

04/30/2025, 6:45 AM

Let me compress the data dir now

Alain Russell

04/30/2025, 6:55 AM

shared 🙂

Alain Russell

04/30/2025, 6:57 AM

We’ve found that running v28 we can delete a document + re-merge it and it fixes our missing data issue in the short term so we’re working on this

Kishore Nallan

04/30/2025, 6:58 AM

Got the attachment. Does it also have the queries?

Alain Russell

04/30/2025, 7:03 AM

no, we’ll need to grab some of these once we get some sites back with data 🙂

Alain Russell

04/30/2025, 7:15 AM

We’ve re-indexed and things are looking good for us now. So the crash must be pre v29.RC10

Kishore Nallan

04/30/2025, 7:16 AM

Yeah looks like there are 2 different issues at play: • a reference/join related bug which has been fixed in v29 RC, but • another bug which causes the crashes

Alain Russell

04/30/2025, 7:17 AM

yep agreed. The reference/join bug wasn’t fixed in the special build you made so it must be a different commit.

Alain Russell

04/30/2025, 7:17 AM

rc10 crashed for us so that removes versions above this as a potential problem (hopefully)

Alain Russell

04/30/2025, 7:17 AM

v28 has been stable since we rolled back

Kishore Nallan

04/30/2025, 7:18 AM

Do you know if there were any active writes to the cluster during the crash?

Alain Russell

04/30/2025, 7:19 AM

most likely, they are pretty busy and sync frequently

Kishore Nallan

04/30/2025, 7:21 AM

One way to isolate the issue further: a) Pause writes to the cluster b) Upgrade only 1 node to latest rc16 c) See if it still crashes: this way we know it's purely a search related issue

Alain Russell

04/30/2025, 10:11 AM

We’ll struggle to pause writes in production .. ’ll have a think about how we can setup a replica cluster and simulate the traffic (I must not have setup coredumps properly as we didn’t collect any)

Alain Russell

04/30/2025, 10:11 AM

Are you seeing the same frequency of crashes in any of your cloud clusters?

Kishore Nallan

04/30/2025, 10:35 AM

One other customer reported issues, but they downgraded because they didn't really need the RC build urgently.

Kishore Nallan

05/01/2025, 3:07 AM

I've some clues to what's going wrong. Continuing to work on this, hope to fix this today.

Alain Russell

05/01/2025, 3:17 AM

Ok thanks, we’re unsure how to test it properly .. we might see if we can replicate traffic to a test cluster. Hard when dev is fine but prod crashes 🙂

Alain Russell

05/01/2025, 5:05 AM

Interestingly we seem to be having a hang issue on v28 today (just one server) - it seems that the threadpool just increases over time and the server is unhealthy

Copy code

W20250501 04:43:46.288156  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 464, thread_pool_len: 32
E20250501 04:43:47.029040  2629 raft_server.cpp:758] 1706 lagging entries > healthy read lag of 1000
E20250501 04:43:47.029093  2629 raft_server.cpp:770] 1706 lagging entries > healthy write lag of 500
W20250501 04:43:47.105521  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 465, thread_pool_len: 32
W20250501 04:43:47.859311  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 466, thread_pool_len: 32
I20250501 04:43:48.916607  2630 batched_indexer.cpp:422] Running GC for aborted requests, req map size: 12
I20250501 04:43:48.916646  2630 batched_indexer.cpp:430] Stuck req_key: 1746073095916987
I20250501 04:43:48.916652  2630 batched_indexer.cpp:430] Stuck req_key: 1746073096124772
I20250501 04:43:48.916656  2630 batched_indexer.cpp:430] Stuck req_key: 1746073096222929
I20250501 04:43:48.916661  2630 batched_indexer.cpp:430] Stuck req_key: 1746073132203666
I20250501 04:43:48.916666  2630 batched_indexer.cpp:430] Stuck req_key: 1746073132249008

then 10 mins later

Copy code

E20250501 04:55:11.082156  2629 raft_server.cpp:758] 2341 lagging entries > healthy read lag of 1000
E20250501 04:55:11.082221  2629 raft_server.cpp:770] 2341 lagging entries > healthy write lag of 500
I20250501 04:55:14.082486  2629 raft_server.cpp:683] Term: 81, pending_queue: 0, last_index: 32593772, committed: 32593772, known_applied: 32591430, applying: 32591431, pending_writes: 0, queued_writes: 12
, local_sequence: 347997206
W20250501 04:55:15.734237  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 681, thread_pool_len: 32
W20250501 04:55:16.605015  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 682, thread_pool_len: 32
W20250501 04:55:17.429947  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 683, thread_pool_len: 32
W20250501 04:55:18.203313  2426 threadpool.h:103] Threadpool exhaustion detected, task_queue_len: 684, thread_pool_len: 32

Is the expectation that the threadpool should clear and cause some latency, or once it’s exhausted the server will just stop responding?

Kishore Nallan

05/01/2025, 5:07 AM

This is different. If you have a lot of concurrent requests, thread pool starvation can occur. Just increase the default thread pool size bye explicitly setting the server side paramter.

Alain Russell

05/01/2025, 5:08 AM

ok thanks, I’ve made it 64 .. but guess we can increase higher as the CPU on the server is still ~4-5%

Alain Russell

05/01/2025, 5:20 AM

How are your clues to the fix of the crashing issue going?

Kishore Nallan

05/01/2025, 5:22 AM

Just publishing

29.0.rc17

which should be available shortly. This fix worked on another customer who was having similar issues.

Alain Russell

05/01/2025, 5:26 AM

ok thanks, can see the new build now.

Alain Russell

05/01/2025, 5:26 AM

Is the best approach for us to install this on one node, then have that node in production?

Kishore Nallan

05/01/2025, 5:28 AM

Yes, upgrade only 1 node and check if there is a crash like last time.

Alain Russell

05/01/2025, 5:29 AM

Thanks, we’ll do that now

Alain Russell

05/01/2025, 5:34 AM

What was the fix out of interest?

Kishore Nallan

05/01/2025, 5:35 AM

Had to revert a totally unrelated PR. Need to see what blew up, haven't gotten around to that. I was just trying to revert PRs to check what works, and caught this.

Alain Russell

05/01/2025, 5:40 AM

interesting!

Alain Russell

05/01/2025, 5:40 AM

We’ve put one server with this version into production

Alain Russell

05/01/2025, 5:40 AM

We’ll keep an eye on it and let you know how it goes

👍 1

Alain Russell

05/01/2025, 5:54 AM

Nearly 15 mins in and so fsr it’s looking ok - we would have had a crash by now with the previous versions I think. We’ll monitor for the next few hours.

Kishore Nallan

05/01/2025, 6:06 AM

Yes I'm reasonably confident that this fixes it. We will now figure out what went wrong with that PR.

Alain Russell

05/01/2025, 6:07 AM

Great news, we’re ~25 mins in and no crash with quite high traffic (evening for us) so looking good.

🙌 1

Alain Russell

05/01/2025, 6:51 AM

Ok, look like we might have had a crash - but it was ~1 hour in

Kishore Nallan

05/01/2025, 6:52 AM

No stack trace?

Kishore Nallan

05/01/2025, 6:52 AM

In the other customer cluster that had this issue, it has been running without any problems for 3+ hours.

Alain Russell

05/01/2025, 6:56 AM

no stack trace, and the odd thing is it doesn’t recover after the crash - it restarts, but then gets stuck logging this sort of message. I just manually stopped/started it and it’s back to healthy.. but strange the restart after the crash stays unhealthy

Copy code

I20250501 06:49:03.587236  8557 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 95
E20250501 06:49:04.974900  8556 raft_server.cpp:767] 5101 lagging entries > healthy read lag of 1000
E20250501 06:49:04.974954  8556 raft_server.cpp:779] 5101 lagging entries > healthy write lag of 500
I20250501 06:49:07.975206  8556 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599607, committed: 32599607, known_applied: 32588619, applying: 32594506, pending_writes: 0, queued_writes: 103, local_sequence: 348033533
E20250501 06:49:13.975621  8556 raft_server.cpp:767] 5101 lagging entries > healthy read lag of 1000
E20250501 06:49:13.975673  8556 raft_server.cpp:779] 5101 lagging entries > healthy write lag of 500
I20250501 06:49:17.975948  8556 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599607, committed: 32599607, known_applied: 32588619, applying: 32594506, pending_writes: 0, queued_writes: 103, local_sequence: 348033533
E20250501 06:49:22.976302  8556 raft_server.cpp:767] 5101 lagging entries > healthy read lag of 1000
E20250501 06:49:22.976364  8556 raft_server.cpp:779] 5101 lagging entries > healthy write lag of 500
I20250501 06:49:27.976739  8556 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599607, committed: 32599607, known_applied: 32588619, applying: 32594506, pending_writes: 0, queued_writes: 103, local_sequence: 348033533
E20250501 06:49:31.977033  8556 raft_server.cpp:767] 5117 lagging entries > healthy read lag of 1000
E20250501 06:49:31.977088  8556 raft_server.cpp:779] 5117 lagging entries > healthy write lag of 500
I20250501 06:49:37.977525  8556 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599663, committed: 32599663, known_applied: 32588619, applying: 32594506, pending_writes: 0, queued_writes: 103, local_sequence: 348033533

Kishore Nallan

05/01/2025, 6:57 AM

On restart it worked?

Alain Russell

05/01/2025, 6:58 AM

yep

sudo systemctl stop typesense-server.service

took about 1 min .. but then starting it and it catches up

Copy code

E20250501 06:54:13.843116  9841 raft_server.cpp:767] 11175 lagging entries > healthy read lag of 1000
E20250501 06:54:13.843695  9841 raft_server.cpp:779] 11175 lagging entries > healthy write lag of 500
I20250501 06:54:13.843722  9841 raft_server.h:299] Node starts following { leader_id=10.50.3.101:8107:8108, term=81, status=Follower receives message from new leader with the same term.}
I20250501 06:54:13.911813  9843 log.cpp:536] close a full segment. Current first_index: 32599001 last_index: 32599871 raft_sync_segments: 0 will_sync: 1 path: /var/lib/typesense/state/log/log_00000000000032599001_00000000000032599871
I20250501 06:54:13.911964  9843 log.cpp:550] Renamed `/var/lib/typesense/state/log/log_inprogress_00000000000032599001' to `/var/lib/typesense/state/log/log_00000000000032599001_00000000000032599871'
I20250501 06:54:13.912024  9843 log.cpp:114] Created new segment `/var/lib/typesense/state/log/log_inprogress_00000000000032599872' with fd=65
E20250501 06:54:22.696559  9809 raft_server.cpp:767] 8019 lagging entries > healthy read lag of 1000
E20250501 06:54:22.697060  9809 raft_server.cpp:779] 8019 lagging entries > healthy write lag of 500
I20250501 06:54:23.697280  9809 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599895, committed: 32599895, known_applied: 32588619, applying: 32591899, pending_writes: 0, queued_writes: 30, local_sequence: 348002260
E20250501 06:54:31.698508  9809 raft_server.cpp:767] 4583 lagging entries > healthy read lag of 1000
E20250501 06:54:31.698585  9809 raft_server.cpp:779] 4583 lagging entries > healthy write lag of 500
I20250501 06:54:33.698799  9809 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599895, committed: 32599895, known_applied: 32588619, applying: 32596816, pending_writes: 0, queued_writes: 73, local_sequence: 348062604
E20250501 06:54:40.700843  9809 raft_server.cpp:767] 1401 lagging entries > healthy read lag of 1000
E20250501 06:54:40.700923  9809 raft_server.cpp:779] 1401 lagging entries > healthy write lag of 500
I20250501 06:54:43.702620  9809 raft_server.cpp:692] Term: 81, pending_queue: 0, last_index: 32599895, committed: 32599895, known_applied: 32588619, applying: 32599610, pending_writes: 0, queued_

Alain Russell

05/01/2025, 7:08 AM

Syslog shows

Copy code

May  1 06:25:33 prod-typesense2 typesense-server[7126]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/370/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
May  1 06:25:40 prod-typesense2 systemd[1]: typesense-server.service: Main process exited, code=dumped, status=11/SEGV
May  1 06:25:40 prod-typesense2 systemd[1]: typesense-server.service: Failed with result 'core-dump'.
May  1 06:25:40 prod-typesense2 systemd[1]: typesense-server.service: Consumed 3min 39.444s CPU time.
May  1 06:25:40 prod-typesense2 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 1.
May  1 06:25:40 prod-typesense2 systemd[1]: typesense-server.service: Consumed 3min 39.444s CPU time.
May  1 06:25:40 prod-typesense2 typesense-server[8285]: Log directory is configured as: /var/log/typesense
May  1 06:26:08 prod-typesense2 typesense-server[8285]: E20250501 06:26:08.126158  8570 raft_server.cpp:767] 9922 lagging entries > healthy read lag of 1000
May  1 06:26:08 prod-typesense2 typesense-server[8285]: E20250501 06:26:08.126204  8570 raft_server.cpp:779] 9922 lagging entries > healthy write lag of 500

Kishore Nallan

05/01/2025, 7:13 AM

Ok will check.

Alain Russell

05/01/2025, 7:13 AM

It didn’t generate a core-dump so trying to work out how to enable that

Alain Russell

05/01/2025, 7:20 AM

Crashed on restart this time, It did dump a stack trace

Copy code

E20250501 07:19:37.570082 13160 backward.hpp:4200] Stack trace (most recent call last) in thread 13160:
E20250501 07:19:37.570628 13160 backward.hpp:4200] #10   Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in 
E20250501 07:19:37.570703 13160 backward.hpp:4200] #9    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xeab096525edb, in 
E20250501 07:19:37.570770 13160 backward.hpp:4200] #8    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xeab0964bd5b7, in 
E20250501 07:19:37.570823 13160 backward.hpp:4200] #7    Object "/usr/bin/typesense-server", at 0xafb271b5c68b, in execute_native_thread_routine
E20250501 07:19:37.570878 13160 backward.hpp:4200] #6  | Source "include/threadpool.h", line 59, in operator()
E20250501 07:19:37.570930 13160 backward.hpp:4200]       Source "/usr/include/c++/10/future", line 1592, in ThreadPool [0xafb26f59801f]
E20250501 07:19:37.570984 13160 backward.hpp:4200] #5  | Source "/usr/include/c++/10/future", line 1459, in _M_set_result
E20250501 07:19:37.571036 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/future", line 412, in call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>()>*, bool*>
E20250501 07:19:37.571091 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/mutex", line 729, in __gthread_once
E20250501 07:19:37.571147 13160 backward.hpp:4200]       Source "/usr/include/aarch64-linux-gnu/c++/10/bits/gthr-default.h", line 700, in _M_run [0xafb26f7776a3]
E20250501 07:19:37.571211 13160 backward.hpp:4200] #4    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xeab0964c25c3, in 
E20250501 07:19:37.571266 13160 backward.hpp:4200] #3  | Source "/usr/include/c++/10/future", line 572, in operator()
E20250501 07:19:37.571283 13160 backward.hpp:4200]       Source "/usr/include/c++/10/bits/std_function.h", line 622, in _M_do_set [0xafb26f59732b]
E20250501 07:19:37.571298 13160 backward.hpp:4200] #2  | Source "/usr/include/c++/10/bits/std_function.h", line 292, in __invoke_r<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<_Fn, _Alloc, _Res(_Args ...)>::_M_run<std::_Bind<Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>()>, std::allocator<int>, void, {}>::<lambda()>, void>&>
E20250501 07:19:37.571308 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/bits/invoke.h", line 115, in __invoke_impl<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<_Fn, _Alloc, _Res(_Args ...)>::_M_run<std::_Bind<Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>()>, std::allocator<int>, void, {}>::<lambda()>, void>&>
E20250501 07:19:37.571323 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/bits/invoke.h", line 60, in operator()
E20250501 07:19:37.571329 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/future", line 1397, in operator()
E20250501 07:19:37.571334 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/future", line 1456, in __invoke_r<void, std::_Bind<Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>()>&>
E20250501 07:19:37.571341 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/bits/invoke.h", line 110, in __invoke_impl<void, std::_Bind<Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>()>&>
E20250501 07:19:37.571346 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/bits/invoke.h", line 60, in operator()<>
E20250501 07:19:37.571353 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/functional", line 499, in __call<void>
E20250501 07:19:37.571359 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/functional", line 416, in __invoke<Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>&>
E20250501 07:19:37.571364 13160 backward.hpp:4200]     | Source "/usr/include/c++/10/bits/invoke.h", line 95, in __invoke_impl<void, Index::batch_memory_index(Index*, std::vector<index_record>&, const string&, const tsl::htrie_map<char, field>&, const tsl::htrie_map<char, field>&, const string&, const std::vector<char>&, const std::vector<char>&, bool, size_t, size_t, size_t, bool, bool, const tsl::htrie_map<char, field>&, const string&, const spp::sparse_hash_map<std::__cxx11::basic_string<char>, std::set<reference_pair_t> >&)::<lambda()>&>
E20250501 07:19:37.571370 13160 backward.hpp:4200]       Source "/usr/include/c++/10/bits/invoke.h", line 60, in _M_invoke [0xafb26f7abce3]
E20250501 07:19:37.571375 13160 backward.hpp:4200] #1    Source "src/index.cpp", line 679, in operator() [0xafb26f7ab8e7]
E20250501 07:19:37.571380 13160 backward.hpp:4200] #0    Source "src/index.cpp", line 1000, in index_field_in_memory [0xafb26f7aab70]
I20250501 07:19:38.666558 13160 housekeeper.cpp:96] No in-flight search queries were found.
E20250501 07:19:38.666616 13160 typesense_server.cpp:152] Typesense 29.0.rc17 is terminating abruptly.

Kishore Nallan

05/01/2025, 7:21 AM

Ok I will investigate and get back to you.

Kishore Nallan

05/01/2025, 7:22 AM

Do you have vector fields in your schema?

Max Priazhevskii

05/01/2025, 7:24 AM

Yes we do in some collections, and also reference fields too

Kishore Nallan

05/01/2025, 7:24 AM

Kishore Nallan

05/01/2025, 7:25 AM

If it crashed on restart, would you be able to zip that directory and share with me again?

Kishore Nallan

05/01/2025, 7:25 AM

Unlike run-time crashes, crashes during a restart are very predictable and should be easy to debug.

Kishore Nallan

05/01/2025, 7:27 AM

As long as the copy of the data directory is done before the next snapshot runs in ~60 mins after restart.

Alain Russell

05/01/2025, 7:28 AM

yep, will do that now

Alain Russell

05/01/2025, 7:28 AM

Also looks like we dumped a core

Copy code

coredumpctl list
TIME                          PID UID GID SIG     COREFILE     EXE                       SIZE
Thu 2025-05-01 07:19:47 UTC 13131   0   0 SIGSEGV inaccessible /usr/bin/typesense-server  n/a

Alain Russell

05/01/2025, 7:29 AM

(I think)

Alain Russell

05/01/2025, 7:30 AM

It did restart successfully on the second attempt FYI

Alain Russell

05/01/2025, 7:33 AM

actually no, that’s not right - tweaked some config so hopefully we get a file on next crash (thanks chatgpt)

Alain Russell

05/01/2025, 7:34 AM

will share the data dir with you soon

Kishore Nallan

05/01/2025, 7:35 AM

Since it's getting stuck on restart I'm pretty sure I can reproduce it.. when a write causes a crash we try to skip it on the next restart. Which is why it could be working on the second restart

Kishore Nallan

05/01/2025, 7:35 AM

So when I debug, I'll disable this skipping behavior to trigger the crash consistently.

Alain Russell

05/01/2025, 7:36 AM

file is shared 🙂

Kishore Nallan

05/01/2025, 7:38 AM

Thanks. I will check it out. It will take a few hours to get to the bottom of it.

Kishore Nallan

05/01/2025, 10:31 AM

Found the issue. This is a different issue and is unrelated to the crash. There is a race condition when related collections are updating each other's fields. This causes a race condition that sometimes leads to deadlock. That's why it works sometimes on restart. We are working on a fix.

Alain Russell

05/01/2025, 10:35 AM

Nice, thanks - I ended up rolling this back to v28 ,clearing the data to get the server to restart 🙂

Alain Russell

05/01/2025, 10:36 AM

but can try a new version tomorrow and hopefully capture a coredump of the crash

👍 1

Kishore Nallan

05/01/2025, 10:37 AM

Thanks, appreciate the help.

Alain Russell

05/01/2025, 10:42 AM

Think I have a couple of coredump files too now .. want me to share these?

Copy code

ubuntu@prod-typesense2:~$ ls -l /var/lib/systemd/coredump
total 906416
-rw-r--r-- 1 root root 479184660 May  1 07:19 core.typesense-serve.0.ab330c872f2e49898a067a7f02940c64.13131.1746083978000000.zst
-rw-r--r-- 1 root root 448965264 May  1 07:55 core.typesense-serve.0.ab330c872f2e49898a067a7f02940c64.14209.1746086125000000.zst

Alain Russell

05/01/2025, 10:42 AM

one might be the crash on startup, but I think the other was the crash we saw running in prod

Kishore Nallan

05/01/2025, 10:46 AM

No let's first fix this issue we found. We can then look at how it behaves after that.

Alain Russell

05/01/2025, 10:47 AM

ok all good, in theory we’ll capture coredumps if the new version crashes from now on 🙂

✅ 1

Alain Russell

05/01/2025, 10:53 AM

Thanks for being so responsive on this, I’m signing out for the night and can help track any issues down tomorrow with an updated build 👋

Kishore Nallan

05/01/2025, 10:54 AM

Good night!

Alain Russell

05/01/2025, 11:41 PM

Keen to know how you got on once you are online. We’ve had some more thread pool issues even with increasing the limit .. and get into the race condition with the restart which is fun 😐

Alain Russell

05/01/2025, 11:55 PM

Think this is the race condition - I can’t get this server to come online. It’s stuck and have tried killing the process, stopping/starting, stopping/deleting data/starting and no luck unfortunately

Copy code

E20250501 23:52:13.640734  2593 raft_server.cpp:758] 2029 lagging entries > healthy read lag of 1000
E20250501 23:52:13.640801  2593 raft_server.cpp:770] 2029 lagging entries > healthy write lag of 500
I20250501 23:52:14.640938  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652090, committed: 32652090, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:52:22.641407  2593 raft_server.cpp:758] 2043 lagging entries > healthy read lag of 1000
E20250501 23:52:22.641503  2593 raft_server.cpp:770] 2043 lagging entries > healthy write lag of 500
I20250501 23:52:24.641702  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652104, committed: 32652104, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:52:31.642194  2593 raft_server.cpp:758] 2043 lagging entries > healthy read lag of 1000
E20250501 23:52:31.642263  2593 raft_server.cpp:770] 2043 lagging entries > healthy write lag of 500
I20250501 23:52:34.642511  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652104, committed: 32652104, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:52:40.642927  2593 raft_server.cpp:758] 2076 lagging entries > healthy read lag of 1000
E20250501 23:52:40.642990  2593 raft_server.cpp:770] 2076 lagging entries > healthy write lag of 500
I20250501 23:52:44.643304  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652137, committed: 32652137, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:52:49.643666  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:52:49.643735  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:52:54.644065  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:52:58.644340  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:52:58.644420  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:53:04.644853  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:53:07.645090  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:53:07.645167  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:53:10.053274  2594 batched_indexer.cpp:422] Running GC for aborted requests, req map size: 24
I20250501 23:53:14.645665  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:53:16.645853  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:53:16.645912  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:53:24.646462  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:53:25.646586  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:53:25.646653  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:53:34.647258  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:53:34.647316  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:53:34.647337  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
E20250501 23:53:43.647897  2593 raft_server.cpp:758] 2116 lagging entries > healthy read lag of 1000
E20250501 23:53:43.647960  2593 raft_server.cpp:770] 2116 lagging entries > healthy write lag of 500
I20250501 23:53:44.648094  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652177, committed: 32652177, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010
E20250501 23:53:52.648653  2593 raft_server.cpp:758] 2117 lagging entries > healthy read lag of 1000
E20250501 23:53:52.648728  2593 raft_server.cpp:770] 2117 lagging entries > healthy write lag of 500
I20250501 23:53:54.648919  2593 raft_server.cpp:683] Term: 84, pending_queue: 0, last_index: 32652179, committed: 32652179, known_applied: 32649217, applying: 32650061, pending_writes: 0, queued_writes: 24, local_sequence: 348793010

Alain Russell

05/01/2025, 11:57 PM

I’ve grabbed a copy of the data dir just in case

Alain Russell

05/02/2025, 12:05 AM

Ok, it came back up but I’m guessing this is after the leader did a snapshot .. so in theory when this race happens I can force the leader to snapshot with an API call and then clear data and re-start the sever and it’ll be ok (that’s my theory)

Alain Russell

05/02/2025, 1:05 AM

One more thing to note, we’ve only had threadpool issues this week and continue to have them, We’ve only made one change earlier this week to use this new flag so are also going to revert this and see how the cluster behaves FYI (our traffic hasn’t increased, and we’re now setting threadpool to 128 but still exhausting it (yet CPU on the servers are ~4%) https://typesense.org/docs/28.0/api/joins.html#asynchronous-references

Alain Russell

05/02/2025, 1:13 AM

traffic has been very steady for us over the last 2 weeks for reference.

Kishore Nallan

05/02/2025, 2:16 AM

The fix is not trivial so it might take a few days to fix it. The problem is happening because the fields in the collection are making cyclic references and the current implementation is not fully foolproof against this. So this ends up causing deadlock if two writes of two such cyclically referring collections are updated at the same time.

Alain Russell

05/02/2025, 2:19 AM

do you know which fields/collections? We can take a look and see if we can make a change for this 🙂

Kishore Nallan

05/02/2025, 2:28 AM

Here is one example:

Copy code

glassons-com:product:embeddings -> glassons-com:products:sg
glassons-com:products:sg -> glassons-com:product:embeddings

Max Priazhevskii

05/02/2025, 2:33 AM

In ^this setup ,only the embeddings collection has a reference field to products, and we use it to search similar products via JOIN. How does the cyclic reference happen?

Max Priazhevskii

05/02/2025, 2:34 AM

We can share schemas/queries if needed

Kishore Nallan

05/02/2025, 2:36 AM

Sorry, the cycle I mentioned is an artifact of how we have to do a back reference / link even though it's a one-way reference.

Max Priazhevskii

05/02/2025, 2:38 AM

Oh, I see, thanks for the clarification! Is there anything we can do on our end other than dropping the reference field?

Kishore Nallan

05/02/2025, 2:39 AM

No, I think we need to remove this link on our end or atleast make it not deadlock in the way it does currently. Hopefully we will have a fix by EOD.

🙌 2

Alain Russell

05/02/2025, 2:41 AM

Nice, thanks - did you see the note about the change we added earlier this week which we’ve reversed. Hoping this will also stop the other issues we’ve been seeing (and the only customer we have that doesn’t load data properly after a restart had this set so also hoping their data returns as normal if we do see another issue)

Copy code

"async_reference": true

Kishore Nallan

05/02/2025, 2:50 AM

async_reference

just allows you to make sure that you don't have to follow indexing order when indexing related collections. I'm not sure how that could lead to thread starvation. Perhaps the deadlock is just using up the threads. We will wait for this fix to land for further assessment.

👌 1

Alain Russell

05/02/2025, 2:51 AM

Thanks!

Kishore Nallan

05/02/2025, 5:38 PM

I've published a patch in

29.0.rc18

which I have just pushed.

Alain Russell

05/02/2025, 8:22 PM

Thanks, do you think the async change could have affected threadpools - we’ve had 0 issues since recreating the embedding collection without this setting. We can test this version on the server we have with coredumps enabled potentially in case we see more crashes but aware we’ve also changed our data structure 🙂

Kishore Nallan

05/03/2025, 3:58 AM

The bug that caused race conditions could have caused the threads to hang leading to thread starvation.

Alain Russell

05/03/2025, 4:01 AM

Ok, thanks - that makes sense 🙂

Alain Russell

05/03/2025, 4:33 AM

Just to summarise and so we’re really clear (we’re cautious about upgrading for a bit) there were multiple issues we were seeing. 1. v28 had a bug with async collections that when restarted caused the collection to return inconsistent results (sometimes blank) 2. v28 also had a bug with async references that caused a race condition that starved the threadpool (this was the main issue we faced last week) 3. v29 when we upgraded had a crash issue with certain data sets, we hit this as well and I think this was fixed with a revert of a PR that was to do with filtering nested objects (was this the regex issue @Jason Bosco referred to?) We’ve since deleted the reference collection we had that we’d set async references on, this seems to have made v28 stable again for us. In theory we can also now upgrade to v29RC as this has also fixed the crash bug we were seeing? Apologies for questions, but we had a bit of downtime this week that we’re not keen to replicate as you can imagine 🙂

Kishore Nallan

05/03/2025, 4:49 AM

Async references is a new feature and some of these issues show up only on very high concurrent reads and writes, as is the case with your deployment. It's impossible to uncover these cases without some real-world battle testing. Our goal is to fix them pro-actively. The crash issue in v29 is a totally different issue and is triggerred only when you have a filter condition which is really large, like having large number of IDs for deletion etc. The regex on such a large filter string caused high cpu load and stack smashing. The reverted PR has been now restored along with the correct fix. This particular regex related issue does not exist on v28 since that feature does not exist in v28.

Alain Russell

05/03/2025, 5:15 AM

Thanks, I appreciate the work you guys have put in to these updates 🙂 We’ll upgrade one of our prod machines to the latest RC and see how it behaves!

Alain Russell

05/03/2025, 5:25 AM

We’ve deployed that new RC on one server in production 🙂 we’ll keep an eye on it and let you know how it goes!

Kishore Nallan

05/03/2025, 5:25 AM

In general updating only one of the nodes in a HA cluster will help catch these issues.

✅ 1

Alain Russell

05/03/2025, 6:25 AM

That server has stopped responding for us, will restart it now

Alain Russell

05/03/2025, 6:27 AM

Logs show that it stalls on updates

Copy code

I20250503 06:23:23.337743 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753082, committed: 32753082, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:23:33.343246 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753111, committed: 32753111, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:23:43.348913 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753111, committed: 32753111, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:23:53.355930 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753112, committed: 32753112, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:24:03.361439 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753129, committed: 32753129, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:24:13.366940 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753129, committed: 32753129, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:24:23.373948 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753129, committed: 32753129, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:24:33.379421 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753129, committed: 32753129, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580
I20250503 06:24:43.384934 14248 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32753135, committed: 32753134, known_applied: 32752998, applying: 0, pending_writes: 0, queued_writes: 135, local_sequence: 350055580

Kishore Nallan

05/03/2025, 6:35 AM

Ok, can you please share data dir again. I suspect there are still lurking race conditions.

Alain Russell

05/03/2025, 6:39 AM

👌 give me a few mins

👍 1

Alain Russell

05/03/2025, 6:44 AM

shared now 🎉

Kishore Nallan

05/03/2025, 6:47 AM

Thanks, will investigate and get back to you.

Kishore Nallan

05/03/2025, 7:02 AM

Btw, did restart work? Or it is hanging only during live indexing? Earlier it used to hang on restart also, what's the behavior now?

Alain Russell

05/03/2025, 7:08 AM

I rolled this back to v28 for restart (I think) but we’ve also removed the async reference collection we had

Alain Russell

05/03/2025, 7:09 AM

If you need the dir before a restart I can go back to v29RC and see if it stalls again

Alain Russell

05/03/2025, 7:10 AM

then get the dara dir after the server is stopped before restart?

Kishore Nallan

05/03/2025, 7:10 AM

Yes, I am running with this dataset and it's not hanging

Alain Russell

05/03/2025, 7:13 AM

ok all good. so re-install v29RC - when it stalls, stop it and send the data without restart?

Kishore Nallan

05/03/2025, 7:16 AM

No, the thing is only if the server again blocks on restart we can reproduce that and debug locally. If it hangs but recovers on restart it means that the race condition is not reproducible on restart, which is what we need for debugging.

Alain Russell

05/03/2025, 7:17 AM

ok, so if it hangs try a restart and if it hangs send the data?

Kishore Nallan

05/03/2025, 7:21 AM

Correct

👌 1

Alain Russell

05/03/2025, 8:41 AM

Looks like we had a crash but the process restarted and recovered ok .. I’ve got a coredump, want me to share this?

Alain Russell

05/03/2025, 8:47 AM

^ Have shared it with you just in case

👍 1

Kishore Nallan

05/03/2025, 8:51 AM

Crash without a stack trace?

Alain Russell

05/03/2025, 8:52 AM

yep, no stack trace

Copy code

I20250503 08:34:00.164059 19804 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32758408, committed: 32758408, known_applied: 32758408, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 350152875
I20250503 08:34:10.169531 19804 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 32758422, committed: 32758422, known_applied: 32758422, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 350153090
Log file created at: 2025/05/03 08:34:20
Running on machine: prod-typesense2
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20250503 08:34:20.575286 20378 typesense_server_utils.cpp:353] Starting Typesense 29.0.rc19
I20250503 08:34:20.575393 20378 typesense_server_utils.cpp:356] Typesense is using jemalloc.

Alain Russell

05/03/2025, 8:52 AM

will check syslog for anything more

Alain Russell

05/03/2025, 8:54 AM

syslog shows

Copy code

May  3 08:34:10 prod-typesense2 typesense-server[19338]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/370/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
May  3 08:34:11 prod-typesense2 systemd[1]: Started Process Core Dump (PID 20370/UID 0).
May  3 08:34:19 prod-typesense2 systemd-coredump[20371]: Process 19338 (typesense-serve) of user 0 dumped core.#012#012Found module /usr/bin/typesense-server with build-id: 32307edc4d4afaec#012Found module /usr/bin/typesense-server with build-id: 32307edc4d4afaec#012Found module /usr/bin/typesense-server with build-id: 32307edc4d4afaec#012Stack trace of thread 19508:#012#0  0x0000be315c6337b0 edata_list_inactive_remove (/usr/bin/typesense-server + 0x48d37b0)#012#012Stack trace of thread 19347:#012#0  0x0000e5130caf9de8 n/a (n/a + 0x0)#012#1  0x0000e5130caf9dc8 n/a (n/a + 0x0)
May  3 08:34:20 prod-typesense2 systemd[1]: systemd-coredump@1-20370-0.service: Deactivated successfully.
May  3 08:34:20 prod-typesense2 systemd[1]: systemd-coredump@1-20370-0.service: Consumed 6.583s CPU time.
May  3 08:34:20 prod-typesense2 systemd[1]: typesense-server.service: Main process exited, code=dumped, status=11/SEGV
May  3 08:34:20 prod-typesense2 systemd[1]: typesense-server.service: Failed with result 'core-dump'.
May  3 08:34:20 prod-typesense2 systemd[1]: typesense-server.service: Consumed 3min 32.561s CPU time.
May  3 08:34:20 prod-typesense2 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 2.
May  3 08:34:20 prod-typesense2 systemd[1]: Stopped Typesense Server.
May  3 08:34:20 prod-typesense2 systemd[1]: typesense-server.service: Consumed 3min 32.561s CPU time.
May  3 08:34:20 prod-typesense2 systemd[1]: Started Typesense Server.
May  3 08:34:20 prod-typesense2 typesense-server[20378]: Log directory is configured as: /var/log/typesense

Kishore Nallan

05/03/2025, 8:55 AM

Not useful, seems to have crashed while trying to print the actual crash trace. I will check the core dump. That should have all the details.

Alain Russell

05/03/2025, 8:56 AM

all good, thanks - we’ll leave this running for now as it seemed to recover ok after the crash 🙂

👍 1

Kishore Nallan

05/03/2025, 9:22 AM

You are on ARM right?

Alain Russell

05/03/2025, 9:27 AM

yep, we host in AWS on graviton

👍 1

Kishore Nallan

05/03/2025, 9:49 AM

Unfortunately the core dump file seems corrupted. Oh well.

Alain Russell

05/03/2025, 9:51 AM

oh damn 😕

Kishore Nallan

05/03/2025, 9:53 AM

I know it's late so it's not urgent, but would you be able to restore the earlier

typesense.tgz

file that was shared? I deleted it but I think I wanted to revisit that data again.

Alain Russell

05/03/2025, 9:56 AM

oh, I might have deleted it locally .. let me see if I have it on the server still and will re-share

Kishore Nallan

05/03/2025, 9:56 AM

Google Drive says it's in your trash 🙂

Alain Russell

05/03/2025, 9:57 AM

hah, restored - can you see it now?

Kishore Nallan

05/03/2025, 9:58 AM

Works, thanks.

👌 1

Kishore Nallan

05/05/2025, 5:46 PM

An update: I've identified and fixing a few race conditions by instrumenting the code with a memory profiler. I will ping once we have a build that fixes those issues.

Alain Russell

05/05/2025, 6:45 PM

Awesome. Thanks for the update!

Alain Russell

05/07/2025, 8:45 AM

How are you getting on?

Kishore Nallan

05/07/2025, 9:26 AM

I've a patch that fixes the issues that showed up in the profiler. It's a large change so going through some additional testing now.

Alain Russell

05/07/2025, 9:49 AM

Nice, thanks for the update - we’re happy to test it on a node once ready 🙂

Kishore Nallan

05/07/2025, 10:46 AM

I've just kicked off a

29.0.rc20

build which fixes the deadlock issues flagged by the profiler. Will be available in about 45 mins.

Alain Russell

05/07/2025, 7:09 PM

Thanks! We’ll deploy this to a node today and see how it behaves

Alain Russell

05/07/2025, 11:23 PM

We’ve updated a node and have it serving production traffic now - will let you know how we get on 🙂

Alain Russell

05/07/2025, 11:56 PM

We just had a crash with the new RC - it then looks to have crashed on restart (but the second restart recovered ok) TS log

Copy code

I20250507 23:44:50.438378 58573 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 33079635 last_index: 33079749
I20250507 23:44:50.438529 58573 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 33079750 last_index: 33079898
I20250507 23:44:50.438727 58573 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 33079899 last_index: 33080665
I20250507 23:44:50.439497 58573 log.cpp:1100] load closed segment, path: /var/lib/typesense/state/log first_index: 33080666 last_index: 33080875
I20250507 23:44:50.439740 58573 log.cpp:1112] load open segment, path: /var/lib/typesense/state/log first_index: 33080876
I20250507 23:44:50.440557 58621 raft_server.cpp:644] on_snapshot_load
I20250507 23:44:50.441053 58621 store.cpp:246] rm /var/lib/typesense/analytics/db success
I20250507 23:44:50.441179 58621 store.cpp:256] copy snapshot /var/lib/typesense/state/snapshot/snapshot_00000000000033078638/analytics_db_snapshot to /var/lib/typesense/analytics/db success
I20250507 23:44:50.441195 58621 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/analytics/db
I20250507 23:44:50.448840 58621 store.cpp:270] DB open success!
I20250507 23:44:50.465725 58621 store.cpp:246] rm /var/lib/typesense/db success
I20250507 23:44:50.465922 58621 store.cpp:256] copy snapshot /var/lib/typesense/state/snapshot/snapshot_00000000000033078638/db_snapshot to /var/lib/typesense/db success
I20250507 23:44:50.465937 58621 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/db
I20250507 23:44:50.483794 58621 store.cpp:270] DB open success!
I20250507 23:44:50.483817 58621 raft_server.cpp:598] Loading collections from disk...
I20250507 23:44:50.483824 58621 collection_manager.cpp:385] CollectionManager::load()
I20250507 23:44:50.484022 58621 auth_manager.cpp:35] Indexing 81 API key(s) found on disk.
I20250507 23:44:50.484400 58621 collection_manager.cpp:421] Loading upto 8 collections in parallel, 1000 documents at a time.
I20250507 23:44:50.485163 58621 collection_manager.cpp:430] Found 140 collection(s) on disk.

Syslog

Copy code

May  7 23:44:49 prod-typesense3 typesense-server[57199]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/370/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
May  7 23:44:50 prod-typesense3 systemd[1]: typesense-server.service: Main process exited, code=killed, status=11/SEGV
May  7 23:44:50 prod-typesense3 systemd[1]: typesense-server.service: Failed with result 'signal'.
May  7 23:44:50 prod-typesense3 systemd[1]: typesense-server.service: Consumed 2min 17.612s CPU time.
May  7 23:44:50 prod-typesense3 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 1.
May  7 23:44:50 prod-typesense3 systemd[1]: Stopped Typesense Server.
May  7 23:44:50 prod-typesense3 systemd[1]: typesense-server.service: Consumed 2min 17.612s CPU time.
May  7 23:44:50 prod-typesense3 systemd[1]: Started Typesense Server.
May  7 23:44:50 prod-typesense3 typesense-server[58109]: Log directory is configured as: /var/log/typesense
May  7 23:45:18 prod-typesense3 typesense-server[58109]: E20250507 23:45:18.804023 58587 raft_server.cpp:767] 2658 lagging entries > healthy read lag of 1000
May  7 23:45:18 prod-typesense3 typesense-server[58109]: E20250507 23:45:18.804445 58587 raft_server.cpp:779] 2658 lagging entries > healthy write lag of 500
May  7 23:45:51 prod-typesense3 typesense-server[58109]: typesense-server: /root/.cache/bazel/_bazel_root/45fa99e3ef9c0d8b8591cceacfe70dd5/sandbox/linux-sandbox/370/execroot/__main__/external/elfutils/libdwfl/derelocate.c:178: cache_sections: Assertion `refs == NULL' failed.
May  7 23:45:52 prod-typesense3 systemd[1]: typesense-server.service: Main process exited, code=killed, status=11/SEGV
May  7 23:45:52 prod-typesense3 systemd[1]: typesense-server.service: Failed with result 'signal'.
May  7 23:45:52 prod-typesense3 systemd[1]: typesense-server.service: Consumed 1min 23.020s CPU time.
May  7 23:45:52 prod-typesense3 systemd[1]: typesense-server.service: Scheduled restart job, restart counter is at 2.
May  7 23:45:52 prod-typesense3 systemd[1]: Stopped Typesense Server.
May  7 23:45:52 prod-typesense3 systemd[1]: typesense-server.service: Consumed 1min 23.020s CPU time.
May  7 23:45:52 prod-typesense3 systemd[1]: Started Typesense Server.
May  7 23:45:52 prod-typesense3 typesense-server[58766]: Log directory is configured as: /var/log/typesense

Alain Russell

05/08/2025, 1:09 AM

I’ll revert this back to v28 - it looks like these are quite regular

Copy code

I20250508 00:19:49.879379 60495 raft_server.cpp:605] Finished loading collections from disk.
I20250508 00:22:14.601760 61220 raft_server.cpp:605] Finished loading collections from disk.
I20250508 00:29:32.103709 61823 raft_server.cpp:605] Finished loading collections from disk.
I20250508 00:34:36.238291 62509 raft_server.cpp:605] Finished loading collections from disk.
I20250508 00:54:33.855804 63270 raft_server.cpp:605] Finished loading collections from disk.

Kishore Nallan

05/08/2025, 2:43 AM

Crashed on restart

Did you happen to take a data directory copy here?

Alain Russell

05/08/2025, 3:06 AM

I didn’t unfortunately, but could re-install and see if we can replicate it. Do I stop TS if this happens and compress the data .. or let it start successfully and then compress?

Kishore Nallan

05/08/2025, 3:16 AM

Yes the crucial things is to have a data directory copy when Typesense fails on restart because that's when I can re-run and reproduce the issue. So the first crash during runtime is not useful. If it crashes on restart, then yes please grab the data directory then.

Kishore Nallan

05/08/2025, 3:37 AM

Core dump is also useful btw.

Kishore Nallan

05/08/2025, 3:38 AM

Can you check if you have the core dump of the crash?

Kishore Nallan

05/08/2025, 3:38 AM

Oh but wait, I had trouble loading it. Sorry I forgot that.

Alain Russell

05/08/2025, 3:38 AM

ok, all good

Kishore Nallan

05/08/2025, 3:38 AM

Atleast that particular coredump.

Alain Russell

05/08/2025, 3:39 AM

I’ll re-install and see if we see the same crash on restart

👍 1

Alain Russell

05/08/2025, 3:39 AM

This is a different server so I’ll need to configure core dumps. our :LEADER node always seems to switch between what we call prod1 and prod2 .. so we’re using prod3 for this install as it never seems to become LEADER so feel a bit more comfortable about it.

Alain Russell

05/08/2025, 3:40 AM

So just to clarify. If we see a crash on startup after another crash - stop TS and then compress the data dir

Kishore Nallan

05/08/2025, 3:41 AM

Correct.

Kishore Nallan

05/08/2025, 3:42 AM

This can be done within 1 hour of the second crash. Because the default snapshot interval is set to 1 hour.

✅ 1

Alain Russell

05/08/2025, 4:34 AM

Ok, we had a crash and this time it didn’t recover - stopping the process and will send the data over

Alain Russell

05/08/2025, 4:35 AM

Copy code

E20250508 04:33:42.934588 66918 raft_server.cpp:767] 1607 lagging entries > healthy read lag of 1000
E20250508 04:33:42.934661 66918 raft_server.cpp:779] 1607 lagging entries > healthy write lag of 500
I20250508 04:33:45.934913 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33102116, committed: 33102116, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:33:51.935331 66918 raft_server.cpp:767] 1650 lagging entries > healthy read lag of 1000
E20250508 04:33:51.935388 66918 raft_server.cpp:779] 1650 lagging entries > healthy write lag of 500
I20250508 04:33:55.935705 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33102161, committed: 33102161, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:34:00.936070 66918 raft_server.cpp:767] 1676 lagging entries > healthy read lag of 1000
E20250508 04:34:00.936142 66918 raft_server.cpp:779] 1676 lagging entries > healthy write lag of 500
I20250508 04:34:05.936522 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33102185, committed: 33102185, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:34:09.936829 66918 raft_server.cpp:767] 1676 lagging entries > healthy read lag of 1000
E20250508 04:34:09.936900 66918 raft_server.cpp:779] 1676 lagging entries > healthy write lag of 500

Kishore Nallan

05/08/2025, 4:36 AM

and this time it didn’t recover

When you mean by it didn't recover, is it just stuck?

Alain Russell

05/08/2025, 4:36 AM

yep, unhealthy state lagging entries is just growing

Kishore Nallan

05/08/2025, 4:36 AM

Ok, and this is on a restart?

Alain Russell

05/08/2025, 4:37 AM

I’m comopressing the data now - yes, after the restart. Although it did look like it was healthy for a while, then got into this state.

Alain Russell

05/08/2025, 4:38 AM

I’ll get a better look at the logs shortly

👍 1

Kishore Nallan

05/08/2025, 4:39 AM

What I wish to know is whether after restart it did catch up (i.e. recovered successully) and then got stuck (because of live writes) or whether it never got healthy. Hopefully it's the latter.

Alain Russell

05/08/2025, 4:42 AM

Actually no sorry, it looks like the flow was Install RC20, what I thought was a crash and indexes loading was actually just the startup of RC20 - so this appears to have been running ok until it just started lagging, then never recovered

Copy code

I20250508 04:09:55.427745 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100812, committed: 33100811, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:05.433208 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100861, committed: 33100861, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:14.331883 66919 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 0
I20250508 04:10:15.438922 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100882, committed: 33100882, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:21.095005 66949 log.cpp:536] close a full segment. Current first_index: 33100094 last_index: 33100892 raft_sync_segments: 0 will_sync: 1 path: /var/lib/typesense/state/log/log_00000000000033100094_00000000000033100892
I20250508 04:10:21.095139 66949 log.cpp:550] Renamed `/var/lib/typesense/state/log/log_inprogress_00000000000033100094' to `/var/lib/typesense/state/log/log_00000000000033100094_00000000000033100892'
I20250508 04:10:21.095188 66949 log.cpp:114] Created new segment `/var/lib/typesense/state/log/log_inprogress_00000000000033100893' with fd=70
I20250508 04:10:25.446055 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100910, committed: 33100910, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:35.451594 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100910, committed: 33100910, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:45.457027 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100929, committed: 33100929, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:10:55.464278 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100930, committed: 33100930, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:05.469852 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100930, committed: 33100930, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:15.335372 66919 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 0
I20250508 04:11:15.476284 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100932, committed: 33100932, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:25.482952 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100933, committed: 33100933, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:35.488318 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100970, committed: 33100970, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:45.493450 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33100970, committed: 33100970, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
I20250508 04:11:55.500308 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33101013, committed: 33101013, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:11:57.500486 66918 raft_server.cpp:779] 504 lagging entries > healthy write lag of 500
I20250508 04:12:05.505687 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33101014, committed: 33101014, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:12:06.505896 66918 raft_server.cpp:779] 506 lagging entries > healthy write lag of 500
I20250508 04:12:15.511322 66918 raft_server.cpp:692] Term: 85, pending_queue: 0, last_index: 33101015, committed: 33101015, known_applied: 33100508, applying: 33100509, pending_writes: 0, queued_writes: 0, local_sequence: 354476183
E20250508 04:12:15.511467 66918 raft_server.cpp:779] 506 lagging entries > healthy write lag of 500
I20250508 04:12:16.339378 66919 batched_indexer.cpp:430] Running GC for aborted requests, req map size: 0
E20250508 04:12:24.517084 66918 raft_server.cpp:779] 506 lagging entries > healthy write lag of 500
I202

Alain Russell

05/08/2025, 4:42 AM

I have the data dir from the last logs if you want it? Or can restart and see if it crashes (in which case I still have the data as well)

Kishore Nallan

05/08/2025, 4:43 AM

Yes let's focus on the crash on restart issue first. Because that's the easiest to debug..

Alain Russell

05/08/2025, 4:45 AM

ok, I’ll restart and see what happens

Kishore Nallan

05/08/2025, 4:45 AM

But you can also share this data directory away.

✅ 1

Alain Russell

05/08/2025, 4:47 AM

The server started up ok this time round

Alain Russell

05/08/2025, 4:48 AM

Shared the directory

Alain Russell

05/08/2025, 4:50 AM

I need to head off so will roll back to v28 for now and try v29RC again later when I can watch the process.

👍 1

Kishore Nallan

05/08/2025, 6:34 AM

There is one other thing you can try tomorrow. Enable this server side flag:

Copy code

enable-search-logging

This will log search queries before they hit the processing code. This way, we can very quickly find out if a particular type of query is causing a crash. Maybe this time it's not a data indexing issue, so restoring snapshots in a debug environment might not help at all because there are no search queries.

Alain Russell

05/08/2025, 11:52 AM

I’ve done this and gone through a couple of crashes. I’ll DM you the logs of the 20 searches logged before each crash

👍 1

Kishore Nallan

05/08/2025, 7:04 PM

I've tried running the queries but it works fine. 1. Are you using a GPU for the embedding? 2. Would it be possible for you to disable searches alone (by blocking the http(s) port) on the node with the RC build. This will allow us to find out if the issue is related to writes or whether searches are also involved. 3. Finally, is it possible for you guys to use Typesense Cloud for a few days so we are able to get better instrumentation and control over the logs and crash? I have tried my best to pin this issue down, but since it's so elusive I suspect I need an environment with full control where we can basically debug it live. We will run 2 nodes with the stable v28 GA but one node on the RC build, and we have better instrumentation and full logs to get to the bottom of it. I hope you will consider this proposal.

Alain Russell

05/08/2025, 11:16 PM

1. We’re not - these are just plain Graviton machines 2. We can upgrade the server and remove it from the target group on the load balance, will that achieve the same thing? 3. Not easily unfortunately - they way we are architected this would be a pretty massive change for us (and not realistic for our production sites). Is there more telemetry I can install on the problem server that will help?

Kishore Nallan

05/09/2025, 1:50 AM

For 2, no you have to block port 80 or whatever http typesense port you have configured using iptables. For 1, what exactly graviton instance? What's the run time memory used? Let's first focus on disabling http traffic to isolate search traffic, which will give us more hints.

Kishore Nallan

05/09/2025, 1:51 AM

E.g. sudo iptables -A INPUT -p tcp --dport 80 -j DROP

Kishore Nallan

05/09/2025, 1:52 AM

To undo sudo iptables -D INPUT -p tcp --dport 80 -j DROP

Alain Russell

05/09/2025, 6:00 AM

I think removing the instance from the load balancer will achieve the same thing. We listen in 8107 (traffic) and 8108 (replication) from memory so removing it will stop all traffic to 8107.

Kishore Nallan

05/09/2025, 6:01 AM

Are you sure it's not 8107 for replication and 8108 for http? That's the default.

Alain Russell

05/09/2025, 6:01 AM

Our instances are r8g.large. 16GB Ram and we’re using about 4GB

Kishore Nallan

05/09/2025, 6:02 AM

Also, how long does a restart take for the node to typically catch up?

Alain Russell

05/09/2025, 6:02 AM

Yeah maybe that. So s long as we stop traffic to everything apart from the relocation port we’re ok?

Kishore Nallan

05/09/2025, 6:02 AM

Yes I think it should be fine.

Alain Russell

05/09/2025, 6:02 AM

Catch up is 5-10seconds.

Kishore Nallan

05/09/2025, 6:02 AM

But wait

Alain Russell

05/09/2025, 6:02 AM

It’s quick. Our data sets are small 😊

Kishore Nallan

05/09/2025, 6:02 AM

The leader also talks to the follower via the http port for some status checks.

Alain Russell

05/09/2025, 6:03 AM

Oh ok. So we want to block that as well?

Kishore Nallan

05/09/2025, 6:03 AM

What is your rough search concurrency, in terms of searches per second?

Kishore Nallan

05/09/2025, 6:03 AM

The

iptables

approach is the simplest. Literally one line command to enable / disable.

Alain Russell

05/09/2025, 6:03 AM

I can work this out but I’d estimate 50-60/sec based on the load balancer graphs.

Kishore Nallan

05/09/2025, 6:03 AM

Ok, an estimate is fine

Kishore Nallan

05/09/2025, 6:04 AM

And the crash happens pretty fast?

Alain Russell

05/09/2025, 6:04 AM

Normally we crash within 20 mins then it’s regular - every 10 mins or so

Kishore Nallan

05/09/2025, 6:05 AM

Did you also check that somehow you are not running out of memory also?

Alain Russell

05/09/2025, 6:05 AM

If there is some debug telemetry I can run we’re happy to do this

Kishore Nallan

05/09/2025, 6:05 AM

Let's first try this http traffic blocking first.

Alain Russell

05/09/2025, 6:05 AM

I’ll check for memory leaks next time we update to the RC. Can do tonight

Alain Russell

05/09/2025, 6:05 AM

But we have a lot of headroom

Kishore Nallan

05/09/2025, 6:06 AM

Yes so, for the next round: 1. Restart and then block http port 2. Monitor memory and wait for crash This will clearly tell us whether the problem is in search path or indexing path.

Kishore Nallan

05/09/2025, 6:07 AM

Based on this, I can produce a debug build for you with some traces. But I think it will be super helpful to debug on x86 rather than ARM.

Alain Russell

05/09/2025, 6:07 AM

Can I mix servers. It’s not hard for us to add an x86 in as this server for testing.

Kishore Nallan

05/09/2025, 6:08 AM

Yes, as long as the node we are testing with the RC build is x86, that's all we need. We also use both ARM and x86 and often have mixed clusters also.

Alain Russell

05/09/2025, 6:09 AM

Ok cool. I’ll switch this one to x86. Enable core dumps and try as above.

Kishore Nallan

05/09/2025, 6:10 AM

Thank you for the help!

❤️ 1

Alain Russell

05/14/2025, 12:27 AM

I still haven’t had a chance to set this server up, will try and get to it today to some stage with the new RC

Kishore Nallan

05/14/2025, 3:04 AM

Meanwhile I found another issue related to group by which we identified with another customer. Can you please try with 29.0.rc21?

Alain Russell

05/14/2025, 3:05 AM

Yes, I can try this shortly.

Alain Russell

05/14/2025, 3:05 AM

Unrelated (maybe related) we just had one of our v28 servers crash - no stack unfortunately

Copy code

I20250514 02:25:34.280957  5178 raft_server.h:60] Peer refresh succeeded!
Log file created at: 2025/05/14 02:25:42
Running on machine: prod-typesense1
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20250514 02:25:42.691740 28480 typesense_server_utils.cpp:348] Starting Typesense 28.0
I20250514 02:25:42.691789 28480 typesense_server_utils.cpp:351] Typesense is using jemalloc.
I20250514 02:25:42.691942 28480 typesense_server_utils.cpp:413] Thread pool size: 128
I20250514 02:25:42.699847 28480 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/db
I20250514 02:25:42.732021 28480 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/meta
I20250514 02:25:42.752151 28480 store.cpp:40] Initializing DB by opening state dir: /var/lib/typesense/analytics/db
I20250514 02:25:42.765798 28480 ratelimit_manager.cpp:546] Loaded 0 rate limit rules.
I20250514 02:25:42.765830 28480 ratelimit_manager.cpp:547] Loaded 0 rate limit bans.
I20250514 02:25:42.766219 28480 typesense_server_utils.cpp:578] Starting API service...
I20250514 02:25:42.766376 28480 http_server.cpp:180] Typesense has started listening on port 8108
I20250514 02:25:42.767145 28943 typesense_server_utils.cpp:521] Conversation garbage collector thread started.

Alain Russell

05/14/2025, 3:06 AM

I’ll install RC21 on this server now and we’ll test it

Alain Russell

05/14/2025, 3:40 AM

We had a crash with this version that stored a coredump, I’m a bit out of my depth with these but have tried to ChatGPT it. Let me know if there is anything you want me to try?

Copy code

warning: Error reading shared library list entry at 0x253c00746473
Failed to read a valid object file image from memory.
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/bin/typesense-server --config=/etc/typesense/typesense-server.ini'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  CountMinSketch::~CountMinSketch (this=0xfb1f1de64240, __in_chrg=<optimized out>) at include/count_min_sketch.h:48
48	include/count_min_sketch.h: No such file or directory.
[Current thread is 1 (LWP 30151)]
(gdb) bt full
#0  CountMinSketch::~CountMinSketch (this=0xfb1f1de64240, __in_chrg=<optimized out>) at include/count_min_sketch.h:48
        i = 0
        i = <optimized out>
        i = <optimized out>
Backtrace stopped: Cannot access memory at address 0xfb1fd46878f8

Alain Russell

05/14/2025, 4:01 AM

We just had another crash, it definitely feels less frequent than before and is now recovering itself pretty quickly. I’m unsure if we should keep this version running or rollback to v28?

Kishore Nallan

05/14/2025, 4:10 AM

Would you be able to share the core dump? It's on x86 right?

Alain Russell

05/14/2025, 4:29 AM

It’s not yet, but I can change this server out and keep this version running.

Kishore Nallan

05/14/2025, 4:30 AM

Yes let's do that. X86 is easier to debug for me locally.

Kishore Nallan

05/14/2025, 4:30 AM

Also please enable search logging again.

Kishore Nallan

05/14/2025, 4:32 AM

Also doing that iptables exercise is crucial to find out if it's a search or indexing related.

Alain Russell

05/14/2025, 4:34 AM

Is it simpler to replace a server or add a 4th into the cluster?

Kishore Nallan

05/14/2025, 4:35 AM

Both will require updating IP address file on existing nodes.

Kishore Nallan

05/14/2025, 4:35 AM

--nodes

Kishore Nallan

05/14/2025, 4:35 AM

Adding a new one is probably easier

Alain Russell

05/14/2025, 4:36 AM

I think I can force an AWs machine to use an IP .. so simpler might be shut-down a node. Force the IP to be the same and start it up with the TS config from the shut-down node. Will try that first

👍 1

Alain Russell

05/14/2025, 5:08 AM

Ok, I have a 4th server in now (same OS, amd64), search logging is enabled + coredumps. Will see how this goes and then we can block ports if it crashes 🙂

Kishore Nallan

05/14/2025, 5:09 AM

👍 awesome

Alain Russell

05/14/2025, 5:25 AM

This version crashed, I’ll share the dump shortly

Copy code

total 457180
-rw-r----- 1 root root 468140190 May 14 05:23 core.typesense-serve.0.642bf0ce7ce942c39be5e9d0a416132b.4856.1747200173000000.zst

👍 1

Alain Russell

05/14/2025, 5:28 AM

Done!

Kishore Nallan

05/14/2025, 5:35 AM

Query logs as well?

Alain Russell

05/14/2025, 5:35 AM

how many before the crash do you need?

Kishore Nallan

05/14/2025, 5:36 AM

Last 1 minute to be safe. I plan to run all the queries through this time under concurrency to check. Assuming it's a search issue.

Kishore Nallan

05/14/2025, 5:37 AM

Will be starting on this investigation in an hour or so.

Alain Russell

05/14/2025, 5:40 AM

Thanks, I’ll just compress the entire log - it’s small. Interesting the query before the crash is a site we use group_by.

Alain Russell

05/14/2025, 5:40 AM

crash is around line 14781

Alain Russell

05/14/2025, 5:41 AM

Log on it;s way

Alain Russell

05/14/2025, 5:42 AM

I’ll leave this server running - less nervous as the other 3 are stable 🙂

Kishore Nallan

05/14/2025, 6:16 AM

Finally managed to reproduce crash locally!

🙌 2

Alain Russell

05/14/2025, 8:07 AM

That’s awesome news!

Kishore Nallan

05/14/2025, 8:08 AM

Should be able to provide a fixed build by tomorrow.

Alain Russell

05/14/2025, 8:12 AM

Cool, I’ll leave this server running but drop it out of the load balancer so it stops getting traffic overnight. We can install an updated build and test production traffic again on it tomorrow

🙌 1

Kishore Nallan

05/14/2025, 6:03 PM

29.0.rc22

published with the patch.

Alain Russell

05/14/2025, 7:04 PM

Thanks! This is installed and back in production - will let you know how it goes 🙂

Alain Russell

05/14/2025, 8:34 PM

So far it’s looking good - 1.5 hours into the new release and no crash 🎉

Alain Russell

05/14/2025, 9:24 PM

2.5 hours - still stable wahoo

Alain Russell

05/15/2025, 1:22 AM

6.5 hours now, no crash - I’d consider this fixed 😄

Kishore Nallan

05/15/2025, 3:13 AM

Cool, thank you for all the help. It was a super edge case bug. It requires a group by that involves a record that has an empty value in one of the array values.

Alain Russell

05/15/2025, 3:14 AM

No problem at all, thanks for persisting 🙂

Alain Russell

05/15/2025, 3:17 AM

I’ll remove the 4th server from our cluster tonight if we see no issues. Is this the preferred way to do that? 1. Shutdown TS on the 4th server 2. Update nodes.ini in the 2 remaining followers, restart these and wait for them to be ready 3. Update nodes.ini on the leader, restart and wait. Hoping this will be the simplest approach?

Kishore Nallan

05/15/2025, 3:33 AM

Don't have to restart the nodes. You can just shut down the 4th node and then remove the node from the nodes.ini file

Alain Russell

05/15/2025, 3:37 AM

Oh really, that’s cool

Alain Russell

05/15/2025, 3:37 AM

same for adding a node, just add to nodes.ini?

Kishore Nallan

05/15/2025, 6:26 AM

Yes

🙌 1

2 Views

Open in Slack

Previous Next