#community-help

Typesense Node Stuck in Segfault Loop After Stress Test

TLDR Adrian encountered a segfault loop when stress testing a Typesense cluster. Kishore Nallan recommended trying a newer RC build and suggested potential issues with hostname resolution.

Powered by Struct AI

4

1

1

Apr 21, 2023 (8 months ago)
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
03:02 PM
I was running a stress test of typesense in a dev environment in which I killed a random node, while the other two remain untouched (thus maintaining quorum) while under constant load. The expected result was that the node I killed would restart and search requests would not face any disruption since the other two nodes could service them. However after killing the first node, that node and a second node got stuck in a crashing cycle in which they segfault upon restart. Any ideas of what could be the cause/remedy here?
03:03
Adrian
03:03 PM
log for an example of a crashing node
+ kubectl logs -p typesense-2
I20230421 14:57:12.762141     1 typesense_server_utils.cpp:325] Starting Typesense 0.25.0.rc18
I20230421 14:57:12.762185     1 typesense_server_utils.cpp:328] Typesense is using jemalloc.
I20230421 14:57:12.762501     1 typesense_server_utils.cpp:377] Thread pool size: 64
I20230421 14:57:12.771956     1 store.h:64] Initializing DB by opening state dir: /usr/share/typesense/data/db
I20230421 14:57:12.787794     1 store.h:64] Initializing DB by opening state dir: /usr/share/typesense/data/meta
I20230421 14:57:12.800720     1 ratelimit_manager.cpp:546] Loaded 0 rate limit rules.
I20230421 14:57:12.800745     1 ratelimit_manager.cpp:547] Loaded 0 rate limit bans.
I20230421 14:57:12.800810     1 typesense_server_utils.cpp:479] Starting API service...
I20230421 14:57:12.800968   261 batched_indexer.cpp:124] Starting batch indexer with 64 threads.
I0421 14:57:12.801124     1 src/http_server.cpp:178] Typesense has started listening on port 8108
F0421 14:57:12.801298   260 external/com_github_brpc_brpc/src/butil/at_exit.cc:46] Check failed: false. Tried to RegisterCallback without an AtExitManager
F0421 14:57:12.801310   260 external/com_github_brpc_brpc/src/butil/at_exit.cc:46] Check failed: false. Tried to RegisterCallback without an AtExitManager
I20230421 14:57:12.808718   261 batched_indexer.cpp:129] BatchedIndexer skip_index: -9999
I0421 14:57:12.880679   260 external/com_github_brpc_brpc/src/brpc/server.cpp:1107] Server[braft::RaftStatImpl+braft::FileServiceImpl+braft::RaftServiceImpl+braft::CliServiceImpl] is serving on port=8107.
I0421 14:57:12.880948   260 external/com_github_brpc_brpc/src/brpc/server.cpp:1110] Check out  in web browser.
I20230421 14:57:12.883850   260 raft_server.cpp:67] Nodes configuration: 192.168.130.10:8107:8108,192.168.158.21:8107:8108,192.168.138.48:8107:8108
I0421 14:57:12.884620   260 external/com_github_brpc_braft/src/braft/log.cpp:690] Use murmurhash32 as the checksum type of appending entries
I0421 14:57:12.884728   260 external/com_github_brpc_braft/src/braft/log.cpp:1172] log load_meta /usr/share/typesense/data/state/log/log_meta first_log_index: 1994 time: 52
I0421 14:57:12.884765   260 external/com_github_brpc_braft/src/braft/log.cpp:1014] restore closed segment, path: /usr/share/typesense/data/state/log first_index: 2381 last_index: 2409
I0421 14:57:12.884779   260 external/com_github_brpc_braft/src/braft/log.cpp:1014] restore closed segment, path: /usr/share/typesense/data/state/log first_index: 2557 last_index: 2587
I0421 14:57:12.884786   260 external/com_github_brpc_braft/src/braft/log.cpp:1014] restore closed segment, path: /usr/share/typesense/data/state/log first_index: 2523 last_index: 2556
I0421 14:57:12.884793   260 external/com_github_brpc_braft/src/braft/log.cpp:1014] restore closed segment, path: /usr/share/typesense/data/state/log first_index: 2190 last_index: 2220
/// cut for brevity
I0421 14:57:12.886557   260 external/com_github_brpc_braft/src/braft/log.cpp:1100] load closed segment, path: /usr/share/typesense/data/state/log first_index: 2465 last_index: 2492
I0421 14:57:12.886644   260 external/com_github_brpc_braft/src/braft/log.cpp:1100] load closed segment, path: /usr/share/typesense/data/state/log first_index: 2493 last_index: 2522
I0421 14:57:12.886728   260 external/com_github_brpc_braft/src/braft/log.cpp:1100] load closed segment, path: /usr/share/typesense/data/state/log first_index: 2523 last_index: 2556
I0421 14:57:12.886834   260 external/com_github_brpc_braft/src/braft/log.cpp:1100] load closed segment, path: /usr/share/typesense/data/state/log first_index: 2557 last_index: 2587
I0421 14:57:12.886915   260 external/com_github_brpc_braft/src/braft/log.cpp:1100] load closed segment, path: /usr/share/typesense/data/state/log first_index: 2588 last_index: 2618
I0421 14:57:12.887000   260 external/com_github_brpc_braft/src/braft/log.cpp:1112] load open segment, path: /usr/share/typesense/data/state/log first_index: 2619
I20230421 14:57:12.887864   306 raft_server.cpp:516] on_snapshot_load
I20230421 14:57:12.890551   306 store.h:299] rm /usr/share/typesense/data/db success
I20230421 14:57:12.890849   306 store.h:309] copy snapshot /usr/share/typesense/data/state/snapshot/snapshot_00000000000000002644/db_snapshot to /usr/share/typesense/data/db success
I20230421 14:57:12.890897   306 store.h:64] Initializing DB by opening state dir: /usr/share/typesense/data/db
I20230421 14:57:12.905711   306 store.h:323] DB open success!
I20230421 14:57:12.905730   306 raft_server.cpp:495] Loading collections from disk...
I0421 14:57:12.905774   306 src/collection_manager.cpp:172] CollectionManager::load()
I20230421 14:57:12.906186   306 auth_manager.cpp:34] Indexing 2 API key(s) found on disk.
I0421 14:57:12.906258   306 src/collection_manager.cpp:192] Loading upto 32 collections in parallel, 1000 documents at a time.
I0421 14:57:12.906284   306 src/collection_manager.cpp:201] Found 1 collection(s) on disk.
I0421 14:57:12.907660   351 src/collection_manager.cpp:122] Found collection testing_collection with 4 memory shards.
I0421 14:57:12.907830   351 src/collection_manager.cpp:1240] Loading collection testing_collection
I0421 14:57:39.899574   351 src/collection_manager.cpp:1357] Indexed 526482/526482 documents into collection testing_collection
I0421 14:57:39.899606   351 src/collection_manager.cpp:240] Loaded 1 collection(s) so far
I0421 14:57:39.900508   306 src/collection_manager.cpp:290] Loaded 1 collection(s).
I0421 14:57:39.901554   306 src/collection_manager.cpp:294] Initializing batched indexer from snapshot state...
I20230421 14:57:39.901597   306 batched_indexer.cpp:446] Restored 0 in-flight requests from snapshot.
I20230421 14:57:39.901636   306 raft_server.cpp:502] Finished loading collections from disk.
I20230421 14:57:39.901664   306 raft_server.h:278] Configuration of this group is 192.168.158.21:8107:8108,192.168.138.48:8107:8108,192.168.149.118:8107:8108
I0421 14:57:39.901756   306 external/com_github_brpc_braft/src/braft/snapshot_executor.cpp:264] node default_group:192.168.138.48:8107:8108 snapshot_load_done, last_included_index: 2644 last_included_term: 17 peers: "192.168.158.21:8107:8108" peers: "192.168.138.48:8107:8108" peers: "192.168.149.118:8107:8108"
I0421 14:57:39.901948   260 external/com_github_brpc_braft/src/braft/raft_meta.cpp:521] Loaded single stable meta, path /usr/share/typesense/data/state/meta term 25 votedfor 192.168.158.21:8107:8108 time: 45
I0421 14:57:39.901981   260 external/com_github_brpc_braft/src/braft/node.cpp:608] node default_group:192.168.138.48:8107:8108 init, term: 25 last_log_id: (index=2648,term=18) conf: 192.168.158.21:8107:8108,192.168.138.48:8107:8108,192.168.149.118:8107:8108 old_conf:
I20230421 14:57:39.902024   260 raft_server.cpp:133] Node last_index: 2648
I20230421 14:57:39.902048   260 typesense_server_utils.cpp:274] Typesense peering service is running on 192.168.138.48:8107
I20230421 14:57:39.902061   260 typesense_server_utils.cpp:275] Snapshot interval configured as: 3600s
I20230421 14:57:39.902067   260 typesense_server_utils.cpp:276] Snapshot max byte count configured as: 4194304
W0421 14:57:39.902084   260 external/com_github_brpc_brpc/src/brpc/controller.cpp:1487] SIGINT was installed with 1
I20230421 14:57:39.903666   260 raft_server.cpp:551] Term: 25, last_index index: 2648, committed_index: 0, known_applied_index: 2644, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 2946374
W20230421 14:57:39.903687   260 raft_server.cpp:578] Multi-node with no leader: refusing to reset peers.
I0421 14:57:43.626524   298 external/com_github_brpc_braft/src/braft/node.cpp:2120] node default_group:192.168.138.48:8107:8108 received PreVote from 192.168.158.21:8107:8108 in term 27 current_term 25 granted 1 rejected_by_lease 0
I0421 14:57:43.629311   300 external/com_github_brpc_braft/src/braft/node.cpp:2202] node default_group:192.168.138.48:8107:8108 received RequestVote from 192.168.158.21:8107:8108 in term 27 current_term 25 log_is_ok 1 votable_time 0
I0421 14:57:43.631441   300 external/com_github_brpc_braft/src/braft/raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 27 votedfor 0.0.0.0:0:0 time: 2037
I0421 14:57:43.633334   300 external/com_github_brpc_braft/src/braft/raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 27 votedfor 192.168.158.21:8107:8108 time: 1841
I20230421 14:57:44.194659   304 raft_server.h:283] Node starts following { leader_id=192.168.158.21:8107:8108, term=27, status=Follower receives message from new leader with the same term.}
I20230421 14:57:44.194927   304 raft_server.h:278] Configuration of this group is 192.168.158.21:8107:8108,192.168.138.48:8107:8108,192.168.149.118:8107:8108
F0421 14:57:44.317002   298 external/com_github_brpc_braft/src/braft/node.cpp:2515] Check failed: entry.type() != ENTRY_TYPE_CONFIGURATION (3 vs 3).
#0 0x0000016c15d2 logging::DestroyLogStream()
#1 0x0000016bf74f logging::LogMessage::~LogMessage()
#2 0x000001330b4c braft::NodeImpl::handle_append_entries_request()
#3 0x000001381623 braft::RaftServiceImpl::append_entries()
#4 0x0000013dc6d0 braft::RaftService::CallMethod()
#5 0x00000154af72 brpc::policy::ProcessRpcRequest()
#6 0x00000155afda brpc::ProcessInputMessage()
#7 0x00000155b0ab brpc::InputMessenger::InputMessageClosure::~InputMessageClosure()
#8 0x00000155c031 brpc::InputMessenger::OnNewMessages()
#9 0x00000143078d brpc::Socket::ProcessEvent()
#10 0x00000162dd76 bthread::TaskGroup::task_runner()
#11 0x000001653731 bthread_make_fcontext

E0421 14:57:44.317345   298 include/backward.hpp:4199] Stack trace (most recent call last) in thread 298:
E0421 14:57:44.317362   298 include/backward.hpp:4199] #13   Object "/opt/typesense-server", at 0x1653730, in
E0421 14:57:44.317366   298 include/backward.hpp:4199] #12   Object "/opt/typesense-server", at 0x162dd75, in
E0421 14:57:44.317369   298 include/backward.hpp:4199] #11   Object "/opt/typesense-server", at 0x143078c, in
E0421 14:57:44.317372   298 include/backward.hpp:4199] #10   Object "/opt/typesense-server", at 0x155c030, in
E0421 14:57:44.317375   298 include/backward.hpp:4199] #9    Object "/opt/typesense-server", at 0x155b0aa, in
E0421 14:57:44.317379   298 include/backward.hpp:4199] #8    Object "/opt/typesense-server", at 0x155afd9, in
E0421 14:57:44.317382   298 include/backward.hpp:4199] #7    Object "/opt/typesense-server", at 0x154af71, in
E0421 14:57:44.317386   298 include/backward.hpp:4199] #6    Object "/opt/typesense-server", at 0x13dc6cf, in
E0421 14:57:44.317389   298 include/backward.hpp:4199] #5    Object "/opt/typesense-server", at 0x1381622, in
E0421 14:57:44.317393   298 include/backward.hpp:4199] #4    Object "/opt/typesense-server", at 0x1330c5c, in
E0421 14:57:44.317396   298 include/backward.hpp:4199] #3    Object "/opt/typesense-server", at 0x139f143, in
E0421 14:57:44.317400   298 include/backward.hpp:4199] #2    Object "/opt/typesense-server", at 0x13a2634, in
E0421 14:57:44.317405   298 include/backward.hpp:4199] #1    Object "/opt/typesense-server", at 0x13a23aa, in
E0421 14:57:44.317409   298 include/backward.hpp:4199] #0    Object "/opt/typesense-server", at 0x1308bc4, in
Segmentation fault (Address not mapped to object [0x8])
E0421 14:57:44.317475   298 src/main/typesense_server.cpp:107] Typesense 0.25.0.rc18 is terminating abruptly.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:05 PM
Can you use a recent RC? The earlier 0.25 rc builds had source file / numbers disabled in the stack traces
03:05
Kishore Nallan
03:05 PM
0.25.0.rc24 is latest
03:07
Kishore Nallan
03:07 PM
How are you generating IPs in the nodes file?
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
03:33 PM
yeah sure will try the new RC
03:35
Adrian
03:35 PM
apiVersion: v1
kind: ConfigMap
metadata:
  name: nodeslist
  namespace: typesense
data:
  nodes: "typesense-0.ts.typesense.svc.cluster.local:8107:8108,typesense-1.ts.typesense.svc.cluster.local:8107:8108,typesense-2.ts.typesense.svc.cluster.local:8107:8108"

using hostname resolution in k8s
03:36
Adrian
03:36 PM
I will note I have been able to kill nodes before and have the cluster recover, but those tests were before any documents were indexed, and while the node was not under load
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:49 PM
Im not very sure how this DNS resolution of IPs happen and what implications they have for clustering. In our own clustering stress tests we always use actual IPs. So maybe there is something there.
03:50
Kishore Nallan
03:50 PM
The raft framework we use for clustering recommends using IPs. But so many people wanted support for DNS so I added an additional resolution.
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
04:09 PM
gotcha. I tried to use stable IP addresses but ran into the blocker that typesense requires that a nodes own peering-address appears in the nodeList. What would solve this would be if typesense could listen on one IP address X, but then know IP address Y in the nodeList corresponds to a proxy to itself. Is that something feasible to add? I think it may just require one extra optional config parameter, and a small code change. https://typesense-community.slack.com/archives/C01P749MET0/p1681225769930529
07:32
Adrian
07:32 PM
here is the log when using the latest image
+ kubectl logs -p typesense-2
I20230421 19:27:34.667822     1 typesense_server_utils.cpp:325] Starting Typesense 0.25.0.rc24
I20230421 19:27:34.667876     1 typesense_server_utils.cpp:328] Typesense is using jemalloc.
I20230421 19:27:34.682269     1 typesense_server_utils.cpp:377] Thread pool size: 64
I20230421 19:27:34.691926     1 store.h:64] Initializing DB by opening state dir: /usr/share/typesense/data/db
I20230421 19:27:34.719142     1 store.h:64] Initializing DB by opening state dir: /usr/share/typesense/data/meta
I20230421 19:27:34.783881     1 ratelimit_manager.cpp:546] Loaded 0 rate limit rules.
I20230421 19:27:34.783929     1 ratelimit_manager.cpp:547] Loaded 0 rate limit bans.
I20230421 19:27:34.784001     1 typesense_server_utils.cpp:479] Starting API service...
I20230421 19:27:34.784184   262 batched_indexer.cpp:124] Starting batch indexer with 64 threads.
I0421 19:27:34.784339     1 src/http_server.cpp:178] Typesense has started listening on port 8108
F0421 19:27:34.784630   261 external/com_github_brpc_brpc/src/butil/at_exit.cc:46] Check failed: false. Tried to RegisterCallback without an AtExitManager
F0421 19:27:34.784657   261 external/com_github_brpc_brpc/src/butil/at_exit.cc:46] Check failed: false. Tried to RegisterCallback without an AtExitManager
I20230421 19:27:34.803128   262 batched_indexer.cpp:129] BatchedIndexer skip_index: -9999
I0421 19:27:34.806750   261 external/com_github_brpc_brpc/src/brpc/server.cpp:1107] Server[braft::RaftStatImpl+braft::FileServiceImpl+braft::RaftServiceImpl+braft::CliServiceImpl] is serving on port=8107.
I0421 19:27:34.881108   261 external/com_github_brpc_brpc/src/brpc/server.cpp:1110] Check out  in web browser.
E20230421 19:27:34.899058   261 raft_server.cpp:173] Unable to resolve host: typesense-0.ts.typesense.svc.cluster.local
E0421 19:27:34.900394   261 external/com_github_brpc_braft/src/braft/configuration.cpp:43] Fail to parse typesense-0.ts.typesense.svc.cluster.local:8107:8108
E20230421 19:27:34.900417   261 raft_server.cpp:53] Failed to parse nodes configuration: `typesense-0.ts.typesense.svc.cluster.local:8107:8108,typesense-1.ts.typesense.svc.cluster.local:8107:8108,typesense-2.ts.typesense.svc.cluster.local:8107:8108` --  will retry shortly...
I20230421 19:28:04.911280   261 raft_server.cpp:67] Nodes configuration: 192.168.140.6:8107:8108,192.168.150.17:8107:8108,192.168.152.182:8107:8108
I0421 19:28:04.911982   261 external/com_github_brpc_braft/src/braft/log.cpp:690] Use murmurhash32 as the checksum type of appending entries
I0421 19:28:04.912066   261 external/com_github_brpc_braft/src/braft/log.cpp:1172] log load_meta /usr/share/typesense/data/state/log/log_meta first_log_index: 12 time: 63
I0421 19:28:04.912105   261 external/com_github_brpc_braft/src/braft/log.cpp:1014] restore closed segment, path: /usr/share/typesense/data/state/log first_index: 251 last_index: 288

... cut for brevity

I20230421 19:28:04.959708   294 store.h:299] rm /usr/share/typesense/data/db success
I20230421 19:28:04.960063   294 store.h:309] copy snapshot /usr/share/typesense/data/state/snapshot/snapshot_00000000000000000592/db_snapshot to /usr/share/typesense/data/db success
I20230421 19:28:04.960111   294 store.h:64] Initializing DB by opening state dir: /usr/share/typesense/data/db
I20230421 19:28:05.015445   294 store.h:323] DB open success!
I20230421 19:28:05.015476   294 raft_server.cpp:495] Loading collections from disk...
I0421 19:28:05.015510   294 src/collection_manager.cpp:172] CollectionManager::load()
I20230421 19:28:05.015738   294 auth_manager.cpp:34] Indexing 1 API key(s) found on disk.
I0421 19:28:05.015847   294 src/collection_manager.cpp:192] Loading upto 32 collections in parallel, 1000 documents at a time.
I0421 19:28:05.015873   294 src/collection_manager.cpp:201] Found 1 collection(s) on disk.
I0421 19:28:05.017504   352 src/collection_manager.cpp:122] Found collection testing_collection with 4 memory shards.
I0421 19:28:05.017741   352 src/collection_manager.cpp:1260] Loading collection testing_collection
I20230421 19:28:35.813575   262 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0
I0421 19:28:36.802410   352 src/collection_manager.cpp:1366] Loaded 475136 documents from testing_collection so far.
I0421 19:28:40.400891   352 src/collection_manager.cpp:1377] Indexed 524641/524641 documents into collection testing_collection
I0421 19:28:40.400958   352 src/collection_manager.cpp:240] Loaded 1 collection(s) so far
I0421 19:28:40.401614   294 src/collection_manager.cpp:290] Loaded 1 collection(s).
I0421 19:28:40.402518   294 src/collection_manager.cpp:294] Initializing batched indexer from snapshot state...
I20230421 19:28:40.402577   294 batched_indexer.cpp:446] Restored 0 in-flight requests from snapshot.
I20230421 19:28:40.402616   294 raft_server.cpp:502] Finished loading collections from disk.
I20230421 19:28:40.402642   294 raft_server.h:278] Configuration of this group is 192.168.150.9:8107:8108,192.168.138.55:8107:8108,192.168.152.182:8107:8108
I0421 19:28:40.402746   294 external/com_github_brpc_braft/src/braft/snapshot_executor.cpp:264] node default_group:192.168.152.182:8107:8108 snapshot_load_done, last_included_index: 592 last_included_term: 6 peers: "192.168.150.9:8107:8108" peers: "192.168.138.55:8107:8108" peers: "192.168.152.182:8107:8108"
I0421 19:28:40.403036   261 external/com_github_brpc_braft/src/braft/raft_meta.cpp:521] Loaded single stable meta, path /usr/share/typesense/data/state/meta term 8 votedfor 192.168.150.17:8107:8108 time: 55
I0421 19:28:40.403073   261 external/com_github_brpc_braft/src/braft/node.cpp:608] node default_group:192.168.152.182:8107:8108 init, term: 8 last_log_id: (index=596,term=8) conf: 192.168.150.17:8107:8108,192.168.138.55:8107:8108,192.168.152.182:8107:8108 old_conf:
I20230421 19:28:40.403128   261 raft_server.cpp:133] Node last_index: 596
I20230421 19:28:40.403158   261 typesense_server_utils.cpp:274] Typesense peering service is running on 192.168.152.182:8107
I20230421 19:28:40.403168   261 typesense_server_utils.cpp:275] Snapshot interval configured as: 3600s
I20230421 19:28:40.403178   261 typesense_server_utils.cpp:276] Snapshot max byte count configured as: 4194304
W0421 19:28:40.403193   261 external/com_github_brpc_brpc/src/brpc/controller.cpp:1487] SIGINT was installed with 1
I20230421 19:28:40.407452   261 raft_server.cpp:551] Term: 8, last_index index: 596, committed_index: 0, known_applied_index: 592, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 1574571
W20230421 19:28:40.407478   261 raft_server.cpp:578] Multi-node with no leader: refusing to reset peers.
I0421 19:28:43.098174   288 external/com_github_brpc_braft/src/braft/node.cpp:2120] node default_group:192.168.152.182:8107:8108 received PreVote from 192.168.150.17:8107:8108 in term 10 current_term 8 granted 1 rejected_by_lease 0
I0421 19:28:43.099168   292 external/com_github_brpc_braft/src/braft/node.cpp:2202] node default_group:192.168.152.182:8107:8108 received RequestVote from 192.168.150.17:8107:8108 in term 10 current_term 8 log_is_ok 1 votable_time 0
I0421 19:28:43.102253   292 external/com_github_brpc_braft/src/braft/raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 10 votedfor 0.0.0.0:0:0 time: 2995
I0421 19:28:43.104696   292 external/com_github_brpc_braft/src/braft/raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 10 votedfor 192.168.150.17:8107:8108 time: 2385
I20230421 19:28:43.592686   290 raft_server.h:283] Node starts following { leader_id=192.168.150.17:8107:8108, term=10, status=Follower receives message from new leader with the same term.}
I20230421 19:28:43.592934   290 raft_server.h:278] Configuration of this group is 192.168.150.16:8107:8108,192.168.138.55:8107:8108,192.168.152.182:8107:8108
I20230421 19:28:43.592998   290 raft_server.h:278] Configuration of this group is 192.168.150.17:8107:8108,192.168.138.55:8107:8108,192.168.152.182:8107:8108
F0421 19:28:43.618484   292 external/com_github_brpc_braft/src/braft/node.cpp:2515] Check failed: entry.type() != ENTRY_TYPE_CONFIGURATION (3 vs 3).
#0 0x0000016c7482 logging::DestroyLogStream()
#1 0x0000016c55ff logging::LogMessage::~LogMessage()
#2 0x0000013369fc braft::NodeImpl::handle_append_entries_request()
#3 0x0000013874d3 braft::RaftServiceImpl::append_entries()
#4 0x0000013e2580 braft::RaftService::CallMethod()
#5 0x000001550e22 brpc::policy::ProcessRpcRequest()
#6 0x000001560e8a brpc::ProcessInputMessage()
#7 0x000001560f5b brpc::InputMessenger::InputMessageClosure::~InputMessageClosure()
#8 0x000001561ee1 brpc::InputMessenger::OnNewMessages()
#9 0x00000143663d brpc::Socket::ProcessEvent()
#10 0x000001633c26 bthread::TaskGroup::task_runner()
#11 0x0000016595e1 bthread_make_fcontext

E0421 19:28:44.312357   292 include/backward.hpp:4200] Stack trace (most recent call last) in thread 292:
E0421 19:28:44.312395   292 include/backward.hpp:4200] #13   Object "/opt/typesense-server", at 0x16595e0, in bthread_make_fcontext
E0421 19:28:44.312399   292 include/backward.hpp:4200] #12   Object "/opt/typesense-server", at 0x1633c25, in bthread::TaskGroup::task_runner(long)
E0421 19:28:44.312402   292 include/backward.hpp:4200] #11   Object "/opt/typesense-server", at 0x143663c, in brpc::Socket::ProcessEvent(void*)
E0421 19:28:44.312405   292 include/backward.hpp:4200] #10   Object "/opt/typesense-server", at 0x1561ee0, in brpc::InputMessenger::OnNewMessages(brpc::Socket*)
E0421 19:28:44.312408   292 include/backward.hpp:4200] #9    Object "/opt/typesense-server", at 0x1560f5a, in brpc::InputMessenger::InputMessageClosure::~InputMessageClosure()
E0421 19:28:44.312410   292 include/backward.hpp:4200] #8    Object "/opt/typesense-server", at 0x1560e89, in brpc::ProcessInputMessage(void*)
E0421 19:28:44.312412   292 include/backward.hpp:4200] #7    Object "/opt/typesense-server", at 0x1550e21, in brpc::policy::ProcessRpcRequest(brpc::InputMessageBase*)
E0421 19:28:44.312415   292 include/backward.hpp:4200] #6    Object "/opt/typesense-server", at 0x13e257f, in braft::RaftService::CallMethod(google::protobuf::MethodDescriptor const*, google::protobuf::RpcController*, google::protobuf::Message const*, google::protobuf::Message*, google::protobuf::Closure*)
E0421 19:28:44.312418   292 include/backward.hpp:4200] #5    Object "/opt/typesense-server", at 0x13874d2, in braft::RaftServiceImpl::append_entries(google::protobuf::RpcController*, braft::AppendEntriesRequest const*, braft::AppendEntriesResponse*, google::protobuf::Closure*)
E0421 19:28:44.312420   292 include/backward.hpp:4200] #4    Object "/opt/typesense-server", at 0x1336b0c, in braft::NodeImpl::handle_append_entries_request(brpc::Controller*, braft::AppendEntriesRequest const*, braft::AppendEntriesResponse*, google::protobuf::Closure*, bool)
E0421 19:28:44.312422   292 include/backward.hpp:4200] #3    Object "/opt/typesense-server", at 0x13a4ff3, in braft::LogManager::append_entries(std::vector<braft::LogEntry*, std::allocator<braft::LogEntry*> >*, braft::LogManager::StableClosure*)
E0421 19:28:44.312425   292 include/backward.hpp:4200] #2    Object "/opt/typesense-server", at 0x13a84e4, in braft::ConfigurationEntry::ConfigurationEntry(braft::LogEntry const&)
E0421 19:28:44.312427   292 include/backward.hpp:4200] #1    Object "/opt/typesense-server", at 0x13a825a, in braft::Configuration::operator=(std::vector<braft::PeerId, std::allocator<braft::PeerId> > const&)
E0421 19:28:44.312429   292 include/backward.hpp:4200] #0    Object "/opt/typesense-server", at 0x130ea74, in std::vector<braft::PeerId, std::allocator<braft::PeerId> >::size() const
Segmentation fault (Address not mapped to object [0x8])
E0421 19:28:45.121225   292 src/main/typesense_server.cpp:107] Typesense 0.25.0.rc24 is terminating abruptly.
07:34
Adrian
07:34 PM
in this test the cluster did recover without manual intervention, but it was after multiple node crashes and restarts so there was significant loss of service
07:35
Adrian
07:35 PM
this seems to only occur when I kill the cluster leader. Any ideas on how to resolve Kishore Nallan? I think this will be a blocker for us using typesense if we cannot figure it out
Apr 22, 2023 (7 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:57 AM
One thing that's strange in the logs above is the mention of "6 peers". That points to an issue in the nodes file or nodes being rotated out too fast before they catch up. 3 node cluster can only tolerate one node failing. If you rotate two nodes at the same time and bring them back up then state can get messed up.

1

11:58
Apr 23, 2023 (7 months ago)
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
07:14 PM
Yup we are using that flag. And I only rotate one node, and then that node and another node get stuck in a segfault crash loop upon restart
07:15
Adrian
07:15 PM
Do you have a response to this point? I think the addition of an extra config option would fix everything
07:18
Adrian
07:18 PM
also for your awareness, we set an internal deadline of EOD Monday to solve this before moving on to another search service solution. Which is why I am urgently trying to solve the issues we are facing
Apr 24, 2023 (7 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:41 AM
It's not possible to have a proxy IP address in the nodes list. It has to match the peering address.

&gt;Yup we are using that flag. And I only rotate one node, and then that node and another node get stuck in a segfault crash loop upon restart

Can you try running without this flag? I wonder if it's actually being detrimental in other ways, like causing this crash.

1

Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
01:13 PM
sure thing I'll try that. Also related questions:
• does the order of nodes in the node list matter?
• does each node have to have the same node list values or can each have a different list?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:32 PM
1. Order does not matter
2. Each must eventually have the same list. Let's say a pod goes down and new pod takes the place. That need not be immediately reflected at the same time, so you could momentarily have some pods have 3 old hosts, and others having 2 hosts and the last having 3 hosts one of which is new.

1

Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
03:51 PM
gotcha makes sense. On point 2 I was thinking each node could have its own ip in the node list, then the proxy IP's for the other nodes. So if there are 3 nodes A, B, C. Where P(X) is the proxy IP address of a node, and IP(X) is the non poxy IP then the node lists could look like this.
A.node_list = IP(A), P(B), P(C)
B.node_list = P(A), IP(B), P(C)
C.node_list = P(A), P(B), IP(C)

1

03:52
Adrian
03:52 PM
just want to make sure this doesn't break any assumptions in the typesense or braft code (it should be possible to implement in kubernetes). Do you foresee any issues?
03:53
Adrian
03:53 PM
If this method works, then every node can have a stable node list at startup, and restarting peers should not cause any issues
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:56 PM
Braft uses the IP address as the identity of the node and is pretty stubborn about it. The earlier issues about rotation being unstable only happens if all 3 pods are rotated at the same time. As long as there is rolling rotation with max-unavailable 1 then this problem should not happen. I don't know why this is such a problem on kubernetes, I thought max unavailable 1 is a common pattern for rolling rotations.
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
04:07 PM
Gotcha. I was thinking that the IP addresses would still be the identity, but could not be used as a global identity (since each node would have a different list). I don't think there is a need for global identities in theory in raft, but it may be required due to braft implementation details. Do you know for sure that braft needs the identities to be identical across all the nodes?
04:10
Adrian
04:10 PM
And I do hear your point about rotations. Appreciate all the help!
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:11 PM
Braft implementation is that way

1

Apr 25, 2023 (7 months ago)
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
01:54 PM
after switching to setting ip addresses directly in each nodes nodelist the cluster seems stable. With the reset flag on it is stable even if quorum is lost and the leader is killed
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:55 PM
Good to hear. So DNS resolving probably has some edge cases even though that should in theory be the same.

1

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Issue Resolution and Upgrade Problems in Typesense Version 0.26rc16

Ankit reported an issue with Typesense, which was addressed by Kishore Nallan and Jason. However, Ankit experienced difficulties while trying to upgrade, with the server status showing as "Not ready or lagging" 503. The resolution remains incomplete.

1

19
2mo

Typesense Processes Intermittently Crash When Indexing Documents

Adrian experiences intermittent typesense process crashes when indexing 500k documents. Kishore Nallan explains that it's not a crash but a backpressure mechanism rejecting writes when ingesting too fast.

1

7
8mo

Large JSONL Documents Import Issue & Resolution

Suraj was having trouble loading large JSONL documents into Typesense server. After several discussions and attempts, it was discovered that the issue was due to data quality. Once the team extracted the data again, the upload process worked smoothly.

run

4

94
9mo

Segfault in Typesense 0.25.0rc24 during Node Restart

Charlie reported a segfault while restarting node in a k8s deployment using version 0.25.0.rc24. Kishore Nallan advised rolling rotations for nodes and confirmed that nodes will join as the cluster expands.

1

14
7mo

Resolving Server Stoppage Issues in Typesense Multi VM Cluster

gaurav faced issues with the Typesense server in a multi VM cluster, including automatic stoppage and errors. Kishore Nallan identified the lack of a quorum and suggested using three nodes. When the issue persisted, they advised running Typesense via `nohup` or `systemd` to prevent session closure from stopping the process.

2

31
13mo