Hi when testing high availability with raft when I kill a no typesense #community-help

Join Slack

Hi, when testing high availability with raft, when...

# community-help

pboros

08/01/2022, 1:16 PM

Hi, when testing high availability with raft, when I kill a node, the new elected leader crashes intermittently.

pboros

08/01/2022, 1:17 PM

Copy code

ts3     | I20220801 06:02:29.496749   199 raft_server.cpp:534] Term: 4, last_index index: 3, committed_index: 3, known_applied_index: 3, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
ts3     | I20220801 06:02:29.496763   199 node.cpp:3029] node default_group:172.18.0.4:8107:8108 change_peers from 172.18.0.3:8107:8108,172.18.0.4:8107:8108,172.18.0.5:8107:8108 to , begin removing.
ts3     | F20220801 06:02:29.496856   272 replicator.cpp:607] Check failed: !entry->peers->empty() log_index=4
ts3     | *** Check failure stack trace: ***
ts3     |     @          0x109b422  google::LogMessage::Fail()
ts3     |     @          0x109b380  google::LogMessage::SendToLog()
ts3     |     @          0x109acc2  google::LogMessage::Flush()
ts3     |     @          0x109e19c  google::LogMessageFatal::~LogMessageFatal()
ts3     |     @           0x7609ad  braft::Replicator::_prepare_entry()
ts3     |     @           0x769415  braft::Replicator::_send_entries()
ts3     |     @           0x769e68  braft::Replicator::_continue_sending()
ts3     |     @           0x7b274d  braft::LogManager::run_on_new_log()
ts3     |     @           0x8085af  bthread::TaskGroup::task_runner()
ts3     |     @           0x963441  bthread_make_fcontext
ts3     | E20220801 06:02:29.858803   272 backward.hpp:4199] Stack trace (most recent call last) in thread 272:
ts3     | E20220801 06:02:29.858835   272 backward.hpp:4199] #12   Object "/opt/typesense-server", at 0x963440, in bthread_make_fcontext
ts3     | E20220801 06:02:29.858839   272 backward.hpp:4199] #11   Object "/opt/typesense-server", at 0x8085ae, in bthread::TaskGroup::task_runner(long)
ts3     | E20220801 06:02:29.858846   272 backward.hpp:4199] #10   Source "/opt/braft-80d97b2475b3c0afca79c19b64d46bb665d704f4/src/braft/log_manager.cpp", line 831, in run_on_new_log [0x7b274c]
ts3     | E20220801 06:02:29.858848   272 backward.hpp:4199] #9    Source "/opt/braft-80d97b2475b3c0afca79c19b64d46bb665d704f4/src/braft/replicator.cpp", line 723, in _continue_sending [0x769e67]
ts3     | E20220801 06:02:29.858850   272 backward.hpp:4199] #8    Source "/opt/braft-80d97b2475b3c0afca79c19b64d46bb665d704f4/src/braft/replicator.cpp", line 649, in _send_entries [0x769414]
ts3     | E20220801 06:02:29.858851   272 backward.hpp:4199] #7    Source "/opt/braft-80d97b2475b3c0afca79c19b64d46bb665d704f4/src/braft/replicator.cpp", line 617, in _prepare_entry [0x7609ac]
ts3     | E20220801 06:02:29.858853   272 backward.hpp:4199] #6    Object "/opt/typesense-server", at 0x109e19b, in google::LogMessageFatal::~LogMessageFatal()
ts3     | E20220801 06:02:29.858855   272 backward.hpp:4199] #5    Object "/opt/typesense-server", at 0x109acc1, in google::LogMessage::Flush()
ts3     | E20220801 06:02:29.858856   272 backward.hpp:4199] #4    Object "/opt/typesense-server", at 0x109b37f, in google::LogMessage::SendToLog()
ts3     | E20220801 06:02:29.858858   272 backward.hpp:4199] #3    Object "/opt/typesense-server", at 0x109b421, in google::LogMessage::Fail()
ts3     | E20220801 06:02:29.858860   272 backward.hpp:4199] #2    Object "/opt/typesense-server", at 0x10a26b9, in google::DumpStackTraceAndExit()
ts3     | E20220801 06:02:29.858861   272 backward.hpp:4199] #1    Source "/build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c", line 79, in abort [0x7f72626e1858]
ts3     | E20220801 06:02:29.858863   272 backward.hpp:4199] #0    Source "../sysdeps/unix/sysv/linux/raise.c", line 51, in raise [0x7f726270200b]
ts3     | Aborted (Signal sent by tkill() 1 0)

pboros

08/01/2022, 1:18 PM

Did anybody see this so far?

Kishore Nallan

08/01/2022, 1:23 PM

Looks like leader's peers are empty. I've seen these types of error happen if you kill one too many nodes without giving chance for cluster to regain quorum. Are you giving sufficient time for the quorum to recover? We've also improved some edge cases in 0.23.1 (saw you testing with 0.23.0 previously).

pboros

08/01/2022, 1:24 PM

I just killed one, I saw one of the nodes going to leader state and crashing like this shortly after

pboros

08/01/2022, 1:24 PM

I will check 0.23.1

Kishore Nallan

08/01/2022, 1:26 PM

If it continues to happen, will need full logs leading to the crash on all nodes.

pboros

08/01/2022, 1:39 PM

It happens with 0.23.1 as well, not always, but once in every 2-3 times

pboros

08/01/2022, 1:39 PM

let me grab logs

pboros

08/01/2022, 1:40 PM

I have 3 nodes, ts{1,2,3}

pboros

08/01/2022, 1:40 PM

Copy code

root@1e235d517537:/# for i in $(seq 1 3) ; do echo -n "ts${i}: " ;  curl "<http://ts>${i}:8108/debug" -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" ; echo ; done
ts1: {"state":1,"version":"0.23.1"}
ts2: {"state":4,"version":"0.23.1"}
ts3: {"state":4,"version":"0.23.1"}
root@1e235d517537:/# for i in $(seq 1 3) ; do echo -n "ts${i}: " ;  curl "<http://ts>${i}:8108/debug" -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" ; echo ; done
ts1: curl: (6) Could not resolve host: ts1

ts2: {"state":4,"version":"0.23.1"}
ts3: {"state":1,"version":"0.23.1"}
root@1e235d517537:/#
root@1e235d517537:/# for i in $(seq 1 3) ; do echo -n "ts${i}: " ;  curl "<http://ts>${i}:8108/debug" -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" ; echo ; done
ts1: curl: (6) Could not resolve host: ts1

ts2: {"state":4,"version":"0.23.1"}
ts3: curl: (6) Could not resolve host: ts3

pboros

08/01/2022, 1:40 PM

Killed ts1, ts3 was elected a leader and crashed shortly after

Kishore Nallan

08/01/2022, 1:43 PM

Were these brand new nodes?

pboros

08/01/2022, 1:44 PM

Yes

pboros

08/01/2022, 1:44 PM

With no data even

Kishore Nallan

08/01/2022, 1:46 PM

What does the

nodes

file contain? Also are you running these on localhost or remotely?

pboros

08/01/2022, 1:47 PM

I am running them in docker compose, so each node is a separate container

pboros

08/01/2022, 1:47 PM

Copy code

ts1:8107:8108,ts2:8107:8108,ts3:8107:8108

pboros

08/01/2022, 1:47 PM

This is the nodes file

Kishore Nallan

08/01/2022, 1:50 PM

Okay that looks fine. Is it easy for you to just run them from the raw binary on separate ports? Our "chaos monkey" test for the clustering runs that way and I've never found such a trivial issue via that or during any of our rotations (we've done thousands on live rotations on Typesense Cloud). Not ruling out a bug, but a bit surprised to see the log line about

change_peers from 172.18.0.3:8107:8108,172.18.0.4:8107:8108,172.18.0.5:8107:8108 to , begin removing.

-- not sure why it is becoming empty.

pboros

08/01/2022, 1:51 PM

You mean in one OS image? Sure I can do that.

pboros

08/01/2022, 1:52 PM

It could be some kind of race condition because it doesn't always happen

Kishore Nallan

08/01/2022, 1:52 PM

Yes, just start the executable one by one on different ports.

Copy code

./typesense-server --data-dir=/tmp/node-data-1 --api-key=abcd --nodes=/tmp/nodes --api-port=6108 --peering-port=6107 --peering-address=127.0.0.1

Then you can have a nodes file like this locally:

Copy code

127.0.0.1:6107:6108,127.0.0.1:7107:7108,127.0.0.1:8107:8108

pboros

08/01/2022, 1:53 PM

Ok, let me try that one

Kishore Nallan

08/01/2022, 1:53 PM

Since i'm not familiar with Docker compose, will be easier for me to debug if we can reproduce it this way.

pboros

08/01/2022, 1:53 PM

We saw very similar symptoms in running in k8s

pboros

08/01/2022, 1:54 PM

Ok, let me try, I can use the macos binary to see if it can be reproduced that way

Kishore Nallan

08/01/2022, 2:39 PM

Talking about Kubernetes, Typesense deployment on k8s has some quirks because the raft library we are using supports only raw IPs. See this comment for details: https://github.com/typesense/typesense/issues/465#issuecomment-1173536082

pboros

08/01/2022, 2:39 PM

Yes, found that

pboros

08/01/2022, 2:39 PM

I can't reproduce the issue locally, and I think that is the issue

pboros

08/01/2022, 2:40 PM

Both in k8s and docker compose when a container is killed, the hostname can no longer be resolved

pboros

08/01/2022, 2:40 PM

If a process dies, it's just a connection reset at the tcp level

Kishore Nallan

08/01/2022, 3:56 PM

Typesense does resolve hostnames to IPs periodically and on k8s having a sidecar to directly generate IPs works unless k8s decides to relocate all nodes.

Open in Slack

Previous Next