#community-help

Segfault in Typesense 0.25.0rc24 during Node Restart

TLDR Charlie reported a segfault while restarting node in a k8s deployment using version 0.25.0.rc24. Kishore Nallan advised rolling rotations for nodes and confirmed that nodes will join as the cluster expands.

Powered by Struct AI

1

14
7mo
Solved
Join the chat
May 14, 2023 (7 months ago)
Charlie
Photo of md5-3b4a40b7dd97986d0398452f20c236c8
Charlie
08:58 PM
Hello, we are using version 0.25.0rc24 and have received a segfault when restarting a different node. I do not see any changelog for RC versions, and am hoping to verify that the latest rc29 has this fixed (or submit a bug report 🙂). I am also curious if there is a timeline for a GA version of 0.25. Thank you!
08:58
Charlie
08:58 PM
I0512 23:24:54.815885   566 external/com_github_brpc_braft/src/braft/node.cpp:2202] node default_group:192.168.131.159:8107:8108 received RequestVote from 192.168.128.63:8107:8108 in term 44 current_term 43 log_is_ok 1 votable_time 0
I20230512 23:24:54.815958   550 raft_server.h:287] Node stops following { leader_id=192.168.132.42:8107:8108, term=43, status=Raft node receives higher term request_vote_request.}
I0512 23:24:54.822806   566 external/com_github_brpc_braft/src/braft/raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 44 votedfor 0.0.0.0:0:0 time: 6865
I0512 23:24:54.824909   566 external/com_github_brpc_braft/src/braft/raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 44 votedfor 192.168.128.63:8107:8108 time: 2058
I20230512 23:24:54.827086   566 raft_server.h:283] Node starts following { leader_id=192.168.128.63:8107:8108, term=44, status=Follower receives message from new leader with the same term.}
I20230512 23:24:55.325982   552 raft_server.h:278] Configuration of this group is 192.168.132.42:8107:8108,192.168.128.63:8107:8108,192.168.131.159:8107:8108
F0512 23:25:00.412377   552 external/com_github_brpc_braft/src/braft/node.cpp:2515] Check failed: entry.type() != ENTRY_TYPE_CONFIGURATION (3 vs 3).
#0 0x0000016c7482 logging::DestroyLogStream()
#1 0x0000016c55ff logging::LogMessage::~LogMessage()
#2 0x0000013369fc braft::NodeImpl::handle_append_entries_request()
#3 0x0000013874d3 braft::RaftServiceImpl::append_entries()
#4 0x0000013e2580 braft::RaftService::CallMethod()
#5 0x000001550e22 brpc::policy::ProcessRpcRequest()
#6 0x000001560e8a brpc::ProcessInputMessage()
#7 0x000001560f5b brpc::InputMessenger::InputMessageClosure::~InputMessageClosure()
#8 0x000001561ee1 brpc::InputMessenger::OnNewMessages()
#9 0x00000143663d brpc::Socket::ProcessEvent()
#10 0x000001633c26 bthread::TaskGroup::task_runner()
#11 0x0000016595e1 bthread_make_fcontext

E0512 23:25:01.179824   552 include/backward.hpp:4200] Stack trace (most recent call last) in thread 552:
E0512 23:25:01.179858   552 include/backward.hpp:4200] #13   Object "/opt/typesense-server", at 0x16595e0, in bthread_make_fcontext
E0512 23:25:01.179862   552 include/backward.hpp:4200] #12   Object "/opt/typesense-server", at 0x1633c25, in bthread::TaskGroup::task_runner(long)
E0512 23:25:01.179864   552 include/backward.hpp:4200] #11   Object "/opt/typesense-server", at 0x143663c, in brpc::Socket::ProcessEvent(void*)
E0512 23:25:01.179867   552 include/backward.hpp:4200] #10   Object "/opt/typesense-server", at 0x1561ee0, in brpc::InputMessenger::OnNewMessages(brpc::Socket*)
E0512 23:25:01.179870   552 include/backward.hpp:4200] #9    Object "/opt/typesense-server", at 0x1560f5a, in brpc::InputMessenger::InputMessageClosure::~InputMessageClosure()
E0512 23:25:01.179872   552 include/backward.hpp:4200] #8    Object "/opt/typesense-server", at 0x1560e89, in brpc::ProcessInputMessage(void*)
E0512 23:25:01.179874   552 include/backward.hpp:4200] #7    Object "/opt/typesense-server", at 0x1550e21, in brpc::policy::ProcessRpcRequest(brpc::InputMessageBase*)
E0512 23:25:01.179879   552 include/backward.hpp:4200] #6    Object "/opt/typesense-server", at 0x13e257f, in braft::RaftService::CallMethod(google::protobuf::MethodDescriptor const*, google::protobuf::RpcController*, google::protobuf::Message const*, google::protobuf::Message*, google::protobuf::Closure*)
E0512 23:25:01.179883   552 include/backward.hpp:4200] #5    Object "/opt/typesense-server", at 0x13874d2, in braft::RaftServiceImpl::append_entries(google::protobuf::RpcController*, braft::AppendEntriesRequest const*, braft::AppendEntriesResponse*, google::protobuf::Closure*)
E0512 23:25:01.179886   552 include/backward.hpp:4200] #4    Object "/opt/typesense-server", at 0x1336b0c, in braft::NodeImpl::handle_append_entries_request(brpc::Controller*, braft::AppendEntriesRequest const*, braft::AppendEntriesResponse*, google::protobuf::Closure*, bool)
E0512 23:25:01.179890   552 include/backward.hpp:4200] #3    Object "/opt/typesense-server", at 0x13a4ff3, in braft::LogManager::append_entries(std::vector<braft::LogEntry*, std::allocator<braft::LogEntry*> >*, braft::LogManager::StableClosure*)
E0512 23:25:01.179894   552 include/backward.hpp:4200] #2    Object "/opt/typesense-server", at 0x13a84e4, in braft::ConfigurationEntry::ConfigurationEntry(braft::LogEntry const&)
E0512 23:25:01.179897   552 include/backward.hpp:4200] #1    Object "/opt/typesense-server", at 0x13a825a, in braft::Configuration::operator=(std::vector<braft::PeerId, std::allocator<braft::PeerId> > const&)
E0512 23:25:01.179900   552 include/backward.hpp:4200] #0    Object "/opt/typesense-server", at 0x130ea74, in std::vector<braft::PeerId, std::allocator<braft::PeerId> >::size() const
Segmentation fault (Address not mapped to object [0x8])
E0512 23:25:01.796072   552 src/main/typesense_server.cpp:107] Typesense 0.25.0.rc24 is terminating abruptly.
May 15, 2023 (7 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:21 AM
How are you deploying Typesense? Kubernetes?
Adrian
Photo of md5-27ff63286c7b3dcb91085f39e910c437
Adrian
02:56 PM
Hey Charlie is my coworker. Yes we are deploying in k8s. We were rotating a single node in a 3 node cluster when this occurred
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:59 PM
These kinds of errors only typically happen if somehow nodes are not rotated carefully one by one. Also do you use DNS or IP addresses for the pods?
Charlie
Photo of md5-3b4a40b7dd97986d0398452f20c236c8
Charlie
03:13 PM
We are using IP addresses. The IP address list is updated every 10 seconds, and when a node goes offline, its IP is replaced with a dummy IP in the list to keep the count of IPs at 3. Looking at the logs above the error, it appears that no dummy IPs were placed at the time of error (or placed without logs).
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:14 PM
Can you not place a dummy IP? It's okay to have just 2 IPs temporarily.
Charlie
Photo of md5-3b4a40b7dd97986d0398452f20c236c8
Charlie
03:15 PM
Yes, I will make that change. Is it OK to have 1 IP temporarily?
If we move to a 5-node cluster, is it OK to temporarily have 3 IPs?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:16 PM
Technically you should always only do rolling rotations. Rotate one node, wait for it to become healthy before doing next.
Charlie
Photo of md5-3b4a40b7dd97986d0398452f20c236c8
Charlie
03:16 PM
That makes sense. I'm thinking about the initialization process
03:16
Charlie
03:16 PM
(or recovery)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:17 PM
3 node raft cluster can have max 1 unavailable. 5 nodes can tolerate 2 nodes being down. But I'll just play it safe.
Charlie
Photo of md5-3b4a40b7dd97986d0398452f20c236c8
Charlie
03:21 PM
Gotcha, thanks! Given how kubernetes works with pod DNS resolution, when we first install the TS cluster, one pod will come up at a time. If the list has a single IP to start, and soon after has 3 IPs, will typesense boot in single-node mode and be unable to transition to a multi-node cluster?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:22 PM
Nope that's perfectly fine. Nodes will join as cluster expands.

1

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3011 threads (79% resolved)

Join Our Community

Similar Threads

Typesense Node Stuck in Segfault Loop After Stress Test

Adrian encountered a segfault loop when stress testing a Typesense cluster. Kishore Nallan recommended trying a newer RC build and suggested potential issues with hostname resolution.

6

31
7mo

Debugging and Recovery of a Stuck Typesense Cluster

Charlie had a wedged staging cluster. Jason provided debugging and recovery steps, and Adrian helped with more insights. It turns out the issue was insufficient disk space. Once Adrian increased the disk size, the cluster healed itself.

6

55
4mo
Solved

Troubleshooting IP Update on Kubernetes Typesense

Alessandro and Damien are having issues with old IP addresses in a Kubernetes Typesense cluster not being updated. Kishore Nallan provides possible troubleshooting solutions, and mentioned the need for a fix for DNS retries. A suggested update strategy was shared by Aljosa.

2

57
26mo

Troubleshooting Typesense Cluster Multi-node Leadership Error

Bill experienced a problem with a new typesense cluster, receiving an error about no leader and health status issues. Jason and Kishore Nallan provided troubleshooting steps and determined it was likely due to a communication issue between nodes. Kishore Nallan identified a potential solution involving resetting the data directory. Following this, Bill reported the error resolved.

45
24mo
Solved

Testing High Availability with Raft Returns Crashes

pboros reports an issue with usual crashes when testing high availability with Raft. Kishore Nallan suggests checking the quorum recovery period and efficiently logging the crash on all nodes. The issue persists, with pboros suspecting it's due to hostname being no longer resolvable once a container is killed.

33
17mo