Alessandro Tagliapietra
09/06/2021, 11:51 PMKishore Nallan
09/07/2021, 7:45 AMAlessandro Tagliapietra
09/07/2021, 2:31 PMAlessandro Tagliapietra
09/07/2021, 2:32 PMKishore Nallan
09/07/2021, 2:43 PMAlessandro Tagliapietra
09/07/2021, 3:31 PMAlessandro Tagliapietra
09/07/2021, 3:32 PMKishore Nallan
09/07/2021, 3:33 PMAlessandro Tagliapietra
09/07/2021, 3:48 PMcan't do pre_vote as it is not in old-ip1,old-ip2.old-ip3
or that when a node tries to join it tries to connect to old IPs and they give a timeoutAlessandro Tagliapietra
09/07/2021, 3:48 PMDamien Hardy
09/21/2021, 11:28 AMI20210921 11:27:47.180541 164 node.cpp:1484] node default_group:10.91.1.147:8107:8108 term 3 start pre_vote
W20210921 11:27:47.180575 164 node.cpp:1494] node default_group:10.91.1.147:8107:8108 can't do pre_vote as it is not in 10.91.4.30:8107:8108,10.91.1.146:8107:8108,10.91.2.156:8107:8108
I20210921 11:27:49.052819 161 raft_server.cpp:544] Term: 3, last_index index: 265, committed_index: 0, known_applied_index: 263, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 518
W20210921 11:27:49.052857 161 raft_server.cpp:571] Multi-node with no leader: refusing to reset peers.
foreverKishore Nallan
09/21/2021, 11:30 AMDamien Hardy
09/21/2021, 11:31 AMDamien Hardy
09/21/2021, 11:32 AMKishore Nallan
09/21/2021, 11:33 AM/health
for readiness and /metrics.json
for liveness check. In a 3-pod configuration, only 1 pod can be rotated at a time. If the second pod is rotated too early then cluster can be toast.Damien Hardy
09/21/2021, 11:34 AMKishore Nallan
09/21/2021, 11:34 AMKishore Nallan
09/21/2021, 11:35 AMDamien Hardy
09/21/2021, 11:38 AMDamien Hardy
09/21/2021, 11:39 AMKishore Nallan
09/21/2021, 11:41 AMDamien Hardy
09/21/2021, 11:44 AME20210921 11:41:13.439110 161 raft_server.cpp:182] Unable to resolve host: typesense-poc-1.typesense-poc.infra.svc.cluster.local
E20210921 11:41:13.448516 161 raft_server.cpp:182] Unable to resolve host: typesense-poc-2.typesense-poc.infra.svc.cluster.local
E20210921 11:41:13.448570 161 configuration.cpp:43] Fail to parse typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108
E20210921 11:41:13.448576 161 raft_server.cpp:51] Failed to parse nodes configuration: `typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-2.typesense-poc.infra.svc.cluster.local:8107:8108` -- will retry shortly...
E20210921 11:41:43.464110 161 raft_server.cpp:182] Unable to resolve host: typesense-poc-1.typesense-poc.infra.svc.cluster.local
E20210921 11:41:43.475195 161 raft_server.cpp:182] Unable to resolve host: typesense-poc-2.typesense-poc.infra.svc.cluster.local
E20210921 11:41:43.475241 161 configuration.cpp:43] Fail to parse typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108
E20210921 11:41:43.475247 161 raft_server.cpp:51] Failed to parse nodes configuration: `typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-2.typesense-poc.infra.svc.cluster.local:8107:8108` -- will retry shortly...
E20210921 11:42:13.489486 161 raft_server.cpp:182] Unable to resolve host: typesense-poc-1.typesense-poc.infra.svc.cluster.local
E20210921 11:42:13.500676 161 raft_server.cpp:182] Unable to resolve host: typesense-poc-2.typesense-poc.infra.svc.cluster.local
E20210921 11:42:13.500722 161 configuration.cpp:43] Fail to parse typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108
E20210921 11:42:13.500726 161 raft_server.cpp:47] Giving up parsing nodes configuration: `typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-1.typesense-poc.infra.svc.cluster.local:8107:8108,typesense-poc-2.typesense-poc.infra.svc.cluster.local:8107:8108`
E20210921 11:42:13.500731 161 typesense_server_utils.cpp:256] Failed to start peering state
Here is what append when readynesssprobe is appluied whene creating the statefulsetKishore Nallan
09/21/2021, 11:45 AMtypesense-poc-1.typesense-poc.infra.svc.cluster.local
--> did you exec into a pod to see if that DNS resolves?Damien Hardy
09/21/2021, 11:47 AMKishore Nallan
09/21/2021, 11:50 AMDamien Hardy
09/21/2021, 11:52 AMCarl
09/21/2021, 11:53 AMDamien Hardy
09/21/2021, 12:53 PMpublishNotReadyAddresses=True
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/Kishore Nallan
09/21/2021, 1:14 PMpublishNotReadyAddresses=true
worked?Damien Hardy
09/21/2021, 1:22 PMI20210921 13:20:52.996469 170 node.cpp:1484] node default_group:10.91.4.53:8107:8108 term 1 start pre_vote
W20210921 13:20:52.998214 173 node.cpp:1464] node default_group:10.91.4.53:8107:8108 request PreVote from 10.91.0.52:8107:8108 error: [E2][10.91.0.52:8107][E2]peer_id not exist
W20210921 13:20:52.998258 170 node.cpp:1464] node default_group:10.91.4.53:8107:8108 request PreVote from 10.91.2.166:8107:8108 error: [E2][10.91.2.166:8107][E2]peer_id not exist
I20210921 13:20:58.671039 173 node.cpp:1484] node default_group:10.91.4.53:8107:8108 term 1 start pre_vote
I20210921 13:20:58.673763 165 node.cpp:1435] node default_group:10.91.4.53:8107:8108 received PreVoteResponse from 10.91.0.52:8107:8108 term 1 granted 1
I20210921 13:20:58.673924 165 node.cpp:1549] node default_group:10.91.4.53:8107:8108 term 1 start vote and grant vote self
I20210921 13:20:58.677614 165 raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 2 votedfor 10.91.4.53:8107:8108 time: 3280
W20210921 13:20:58.677717 170 node.cpp:1402] node default_group:10.91.4.53:8107:8108 received invalid PreVoteResponse from 10.91.2.166:8107:8108 ctx_version 5current_ctx_version 6
I20210921 13:20:58.680274 173 node.cpp:1348] node default_group:10.91.4.53:8107:8108 received RequestVoteResponse from 10.91.0.52:8107:8108 term 2 granted 1
I20210921 13:20:58.680315 173 node.cpp:1783] node default_group:10.91.4.53:8107:8108 term 2 become leader of group 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108
I20210921 13:20:58.680354 173 replicator.cpp:138] Replicator=1099511627789@10.91.0.52:8107:8108 is started, group default_group
I20210921 13:20:58.680768 173 replicator.cpp:138] Replicator=1138166333441@10.91.2.166:8107:8108 is started, group default_group
W20210921 13:20:58.681746 165 node.cpp:1315] node default_group:10.91.4.53:8107:8108 received invalid RequestVoteResponse from 10.91.2.166:8107:8108 ctx_version 1 current_ctx_version 2
I20210921 13:20:58.681857 173 log.cpp:108] Created new segment `/usr/share/typesense/data/state/log/log_inprogress_00000000000000000001' with fd=25
I20210921 13:20:58.683583 170 raft_server.h:254] Configuration of this group is 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108
I20210921 13:20:58.683687 170 node.cpp:3142] node default_group:10.91.4.53:8107:8108 reset ConfigurationCtx, new_peers: 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108, old_peers: 10.91.0.52:8107:8108,10.91.4.53:8107:8108,10.91.2.166:8107:8108
I20210921 13:20:58.684051 165 raft_server.h:237] Node becomes leader, term: 2
I20210921 13:21:00.731025 162 raft_server.cpp:544] Term: 2, last_index index: 1, committed_index: 1, known_applied_index: 1, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:21:00.731228 173 raft_server.h:59] Peer refresh succeeded!
Let's try some updateKishore Nallan
09/21/2021, 1:22 PMDamien Hardy
09/21/2021, 1:45 PMI20210921 13:37:53.245733 162 batched_indexer.cpp:174] Running GC for aborted requests, req map size: 0
I20210921 13:38:00.361184 161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:10.365317 161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:20.368885 161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:30.370760 161 raft_server.cpp:544] Term: 2, last_index index: 7, committed_index: 7, known_applied_index: 7, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:31.142607 165 node.cpp:1549] node default_group:10.91.1.151:8107:8108 term 2 start vote and grant vote self
I20210921 13:38:31.142769 169 raft_server.h:263] Node stops following { leader_id=10.91.4.54:8107:8108, term=2, status=A follower's leader_id is reset to NULL as it begins to request_vote.}
W20210921 13:38:31.144012 172 node.cpp:1377] node default_group:10.91.1.151:8107:8108 received RequestVoteResponse from 10.91.4.54:8107:8108 error: [E111]Fail to connect Socket{id=384 addr=10.91.4.54:8107} (0x0x7f285a0f7c00): Connection refused
I20210921 13:38:31.146870 165 raft_meta.cpp:546] Saved single stable meta, path /usr/share/typesense/data/state/meta term 3 votedfor 10.91.1.151:8107:8108 time: 3212
I20210921 13:38:31.146903 165 node.cpp:1077] node default_group:10.91.1.151:8107:8108 received handle_timeout_now_request from 10.91.4.54:53018 at term=2
I20210921 13:38:31.151132 168 node.cpp:1348] node default_group:10.91.1.151:8107:8108 received RequestVoteResponse from 10.91.2.169:8107:8108 term 3 granted 1
I20210921 13:38:31.151156 168 node.cpp:1783] node default_group:10.91.1.151:8107:8108 term 3 become leader of group 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108
I20210921 13:38:31.151181 168 replicator.cpp:138] Replicator=2216203124741@10.91.4.54:8107:8108 is started, group default_group
I20210921 13:38:31.151351 168 replicator.cpp:138] Replicator=3311419785217@10.91.2.169:8107:8108 is started, group default_group
W20210921 13:38:31.151746 165 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=1, [E111]Fail to connect Socket{id=8589934977 addr=10.91.4.54:8107} (0x0x7f285a0f7e00): Connection refused [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
I20210921 13:38:31.155787 172 raft_server.h:254] Configuration of this group is 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108
I20210921 13:38:31.155813 172 node.cpp:3142] node default_group:10.91.1.151:8107:8108 reset ConfigurationCtx, new_peers: 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108, old_peers: 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108
I20210921 13:38:31.156275 165 raft_server.h:237] Node becomes leader, term: 3
I20210921 13:38:31.251830 165 socket.cpp:2201] Checking Socket{id=8589934977 addr=10.91.4.54:8107} (0x7f285a0f7e00)
W20210921 13:38:33.652561 171 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=11, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:34.752591 165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:36.153350 168 replicator.cpp:292] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=21, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:38.252923 171 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:38.654515 171 replicator.cpp:292] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=31, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
I20210921 13:38:40.373698 161 raft_server.cpp:544] Term: 3, last_index index: 8, committed_index: 8, known_applied_index: 8, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:40.373783 165 raft_server.h:59] Peer refresh succeeded!
W20210921 13:38:41.155587 172 replicator.cpp:292] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=41, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:41.753235 165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:43.656654 172 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=51, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:45.253569 165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
W20210921 13:38:46.157712 171 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=61, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:48.658804 165 replicator.cpp:392] Group default_group fail to issue RPC to 10.91.4.54:8107:8108 _consecutive_error_times=71, [E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R1][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R2][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977 [R3][E112]Not connected to 10.91.4.54:8107 yet, server_id=8589934977
W20210921 13:38:48.753927 165 socket.cpp:1193] Fail to wait EPOLLOUT of fd=24: Connection timed out [110]
E20210921 13:38:50.383044 161 raft_server.cpp:182] Unable to resolve host: typesense-poc-0.typesense-poc.infra.svc.cluster.local
E20210921 13:38:50.383810 161 configuration.cpp:43] Fail to parse typesense-poc-0.typesense-poc.infra.svc.cluster.local:8107:8108
I20210921 13:38:50.383844 161 raft_server.cpp:544] Term: 3, last_index index: 8, committed_index: 8, known_applied_index: 8, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 0
I20210921 13:38:50.383863 161 node.cpp:3029] node default_group:10.91.1.151:8107:8108 change_peers from 10.91.4.54:8107:8108,10.91.1.151:8107:8108,10.91.2.169:8107:8108 to , begin removing.
F20210921 13:38:50.383977 172 replicator.cpp:607] Check failed: !entry->peers->empty() log_index=9
*** Check failure stack trace: ***
@ 0x1592de2 google::LogMessage::Fail()
@ 0x1592d40 google::LogMessage::SendToLog()
@ 0x1592682 google::LogMessage::Flush()
@ 0x1595b5c google::LogMessageFatal::~LogMessageFatal()
@ 0xc5754d braft::Replicator::_prepare_entry()
@ 0xc5ffb5 braft::Replicator::_send_entries()
@ 0xc60a08 braft::Replicator::_continue_sending()
@ 0xca92ed braft::LogManager::run_on_new_log()
@ 0xcff76f bthread::TaskGroup::task_runner()
@ 0xe5a8c1 bthread_make_fcontext
E20210921 13:38:50.651902 172 backward.hpp:4199] Stack trace (most recent call last) in thread 172:
E20210921 13:38:50.651933 172 backward.hpp:4199] #12 Object "/opt/typesense-server", at 0xe5a8c0, in bthread_make_fcontext
E20210921 13:38:50.651943 172 backward.hpp:4199] #11 Object "/opt/typesense-server", at 0xcff76e, in bthread::TaskGroup::task_runner(long)
E20210921 13:38:50.651948 172 backward.hpp:4199] #10 Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/log_manager.cpp", line 831, in run_on_new_log [0xca92ec]
E20210921 13:38:50.651952 172 backward.hpp:4199] #9 Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/replicator.cpp", line 723, in _continue_sending [0xc60a07]
E20210921 13:38:50.651955 172 backward.hpp:4199] #8 Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/replicator.cpp", line 649, in _send_entries [0xc5ffb4]
E20210921 13:38:50.651959 172 backward.hpp:4199] #7 Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/replicator.cpp", line 617, in _prepare_entry [0xc5754c]
E20210921 13:38:50.651962 172 backward.hpp:4199] #6 Object "/opt/typesense-server", at 0x1595b5b, in google::LogMessageFatal::~LogMessageFatal()
E20210921 13:38:50.651965 172 backward.hpp:4199] #5 Object "/opt/typesense-server", at 0x1592681, in google::LogMessage::Flush()
E20210921 13:38:50.651969 172 backward.hpp:4199] #4 Object "/opt/typesense-server", at 0x1592d3f, in google::LogMessage::SendToLog()
E20210921 13:38:50.651975 172 backward.hpp:4199] #3 Object "/opt/typesense-server", at 0x1592de1, in google::LogMessage::Fail()
E20210921 13:38:50.651979 172 backward.hpp:4199] #2 Object "/opt/typesense-server", at 0x159a079, in google::DumpStackTraceAndExit()
E20210921 13:38:50.651983 172 backward.hpp:4199] #1 Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f288a992039, in abort
E20210921 13:38:50.651988 172 backward.hpp:4199] #0 Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f288a990438, in raise
Aborted (Signal sent by tkill() 1 0)
E20210921 13:38:50.939128 172 typesense_server.cpp:88] Typesense is terminating abruptly.
Kishore Nallan
09/21/2021, 1:46 PMDamien Hardy
09/21/2021, 1:46 PMDamien Hardy
09/21/2021, 1:47 PMKishore Nallan
09/21/2021, 1:47 PMUnable to resolve host: typesense-poc-0.typesense-poc.infra.svc.cluster.local
so it seems like even typesense-0 is not getting resolved?Damien Hardy
09/21/2021, 1:47 PMKishore Nallan
09/21/2021, 1:47 PMKishore Nallan
09/21/2021, 1:48 PMDamien Hardy
09/21/2021, 1:50 PMKishore Nallan
09/21/2021, 1:54 PMKishore Nallan
09/21/2021, 1:55 PMKishore Nallan
09/21/2021, 1:57 PMDamien Hardy
09/21/2021, 1:58 PMDamien Hardy
09/21/2021, 1:59 PMKishore Nallan
09/21/2021, 2:00 PMKishore Nallan
09/21/2021, 2:00 PMDamien Hardy
09/21/2021, 2:00 PMKishore Nallan
09/21/2021, 2:01 PMKishore Nallan
09/21/2021, 2:04 PMF20210921 133850.383977 172 replicator.cpp:607] Check failed: !entry->peers->empty() log_index=9Manually controlling rotation is probably what we can try first to see if somehow the health readiness is not enough.
Kishore Nallan
09/21/2021, 2:11 PMAljosa Asanovic
11/11/2021, 3:50 AMAljosa Asanovic
11/11/2021, 3:51 AMKishore Nallan
11/11/2021, 3:57 AMKishore Nallan
11/11/2021, 3:59 AMAljosa Asanovic
11/23/2021, 4:11 AMpartition
setting on the rollingUpdate. By setting it to 2
and applying an update, only the third ordinal typesense node in a statefulset gets updated. Once all three nodes are in sync, I lower the setting to 1
and apply my change again, which only affects the second node. Lastly, 0
to finish up. It ensures it goes one at a time
spec:
serviceName: ts
podManagementPolicy: Parallel
replicas: 3
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 2
Kishore Nallan
11/23/2021, 4:28 AM