Hi everyone! we we're running a self-hosted Typese...
# community-help
t
Hi everyone! we we're running a self-hosted Typesense-Cluster with three nodes (0.25.1), due to an hardware error we had to turn off two nodes, the remaining node
typesense-b
became sole "cluster" (remove the other two from the nodes file) and is the leader. After a couple of days the error is now finally solved, but the two other nodes are now out of sync, so i wanted to spin them up. At first i wanted to start
typesense-c
(with an empty nodes file) -> i get an error
Error while refreshing peer configuration: File containing nodes configuration is empty.
curl -H "X-TYPESENSE-API-KEY: xxx" "<http://typesense-c:8108/status>"
says
not ready
When i add the node itself into the nodes-file of
typesense-c
& the peer refresh was done ->
typesense-c
also became a leader Finally my question: how can
typesense-b
remain the leader? I have the fear that when i add both nodes to the nodes-files of both servers, that
typesense-c
will be the leader and i have a data consistency problem (because there is no data at
typesense-c
) Do i need to keep these things in an exact sequence (start typesense-c without a nodes file, add typesense-c to the nodes-file of typesense-b,...)
j
You want to first add typesense-c into the nodes file of typesense-b, while typesense-c is still off. Then update the nodes file in typesense-c to now be typesense-b and typesense-c and then start the process on typesense-c. (Make sure you clear the data dir on typesense-c, so it can resync the latest snapshot from typesense-b)
Then once typesense-c is fully synced and reindexed, you want to add typesense-a to the nodes file of all the nodes and then start the process on typesense-a back up
t
but then typesense-b would be blocking because violating
(N-1)/2
, maybe thats the missing link -> so i could not activate typesense-c without having a short outage of typesense until both are finding each other?
j
but then typesense-b would be blocking because violating (N-1)/2
Great observation! The nuance here is that, if you have a single node leader (where the nodes file only contains that single node's IP) and that node is healthy and serving traffic, you can add a 2nd node to the mix and have it sync data from the leader without any issues. We detect this state and let the 2nd sync data from the leader while the first node is still healthy. If we didn't have this, then there would be no way to add new nodes into a cluster when enabling a clustered environment for eg!
t
okay, so summing up: (1) i need to start typesense-c with a nodes file which contains typesense-b & typesense-c (2) let typesense-c sync the data (3) when both are in sync, add typesense-c to the nodes file in typesense-b and repeat the procedure for typesense-a
j
In 1) you also need to update the nodes file in typesense-b, to include typesense-b & typesense-c. Only then typesense-c will sync data from typesense-b
And when repeating for typesense-a, you'd first need to add typesense-a to the nodes file of all the nodes, and then start up typesense-a
t
hmmm 🤔 - i will give it a try.. but who needs the get both nodes first? do i need to start typesense-c with both nodes and then add typesense-c to the nodes file of typesense-b (when typesense-c is up & running) because typesense-c must be up, otherwise typesense-b will be blocking but from a logical perspective typesense-c must have both nodes and must be up&running and afterwards typesense-b should get the new node
j
I didn't fully understand your last message. But I just updated the docs with a more detailed set of steps:
These steps definitely work, and will let you bring the cluster back up in a multi-node setup without having to bring down the whole cluster. (You can ignore the quorum equation during this recovery, since we've accounted for this specifically)
Let me know if I can clarify anything in those steps
t
Okay, thanks for the change of the documentation, i tried it -> and it is currently catching up 💪
Copy code
W20241213 07:24:18.176734 1702532 node.cpp:843] [default_group:192.168.0.1:8107:8108 ] Refusing concurrent configuration changing
E20241213 07:24:18.176801 1702583 raft_server.h:62] Peer refresh failed, error: Doing another configuration change
I20241213 07:24:23.172806 1702575 node.cpp:754] node default_group:192.168.0.1:8107:8108 waits peer 192.168.0.2:8107:8108 to catch up
and after 4-5 minutes it was fine and i got
Peer refresh succeeded!
Thank you @Jason Bosco for your help!!
👍 1
🙌 1