Hi everyone I had question s about backups and restoring bac typesense #community-help

Hi everyone, I had question(s) about backups and r...

Anirudh Atodaria

03/20/2025, 4:22 PM

Hi everyone, I had question(s) about backups and restoring backups for HA cluster. ## TL;DR When restoring (old) backup on a node, does it take longer for it to get back online, as it has to sync with other nodes? When restoring only one node, should I restore from an old backup (10-12h old) or just take new backup from another node and restoring on failed node? ## Context We're playing around self-hosting Typesense (HA). We've 3 node cluster (3 VMs FYI), everything works fine thanks to docs being on point 🙌 We use cron on VMs to run a script that uses snapshot API and creates a backup tar, we store it in a separate disk attached to VM and we back that disk up every 24h. Now my question is more about restoring this backup, imagine 1 node goes down (maybe leader), other 2 nodes are alive and working, what is the best way to restore it? ## What I did • I stopped one node, other 2 were alive. • I removed data dir from that node • I used the backup (which was probably 10-12h old) and restored it • Typesense server again • It loaded all documents for collection we have • However, the health endpoint returns

false

Why is that? (My guess is that because the backup was old, the node is behind other 2 nodes?) -- FYI the node came back up (alive -- /health returned true) after a couple of hours. (so I suppose my guess is correct?) • Also, if that is the case, creating backup from one of the 2 alive nodes and restoring it on failed node should work instantly? • last question -- restoring same backup on all nodes should work instantly as well?

Anirudh Atodaria

03/20/2025, 4:23 PM

these were interesting logs from failed node after I restored an old backup

Copy code

E20250320 15:53:38.892776 210970 raft_server.cpp:762] 15785 queued writes > healthy read lag of 1000
E20250320 15:53:38.892841 210970 raft_server.cpp:774] 15785 queued writes > healthy write lag of 500

I can also share the entire log file

Anirudh Atodaria

03/20/2025, 4:23 PM

Another thing I noticed is that, when I restored, it loaded all documents multiple times throughout the 2 hours it was down.

Jason Bosco

03/21/2025, 3:55 AM

When you have an HA cluster, and one node goes down and you want to restore it, it's best to just clear the data dir on that node and then start it back up. It will automatically sync the data from the other two nodes. There's no need to do any extra snapshot / restore etc.

👍 1

Jason Bosco

03/21/2025, 3:56 AM

Where snapshot restore is helpful is only during major disaster recovery where say for some reason all 3 nodes are gone and the data is also gone on all 3 nodes (hopefully never happens!). But when that happens, you want to take a previous snapshot, start a single node with it, then add a 2nd node and let that sync from the 1st node and then start a 3rd and let it sync from the other two nodes to establish a cluster

👍 1

Jason Bosco

03/21/2025, 3:58 AM

It's not advisable to load a really old snapshot on to a node in a cluster that has had a lot of ongoing writes since that snapshot, since all those writes will have to replayed on this new node and that might take a long time. Instead you want to let Typesense resync the data for you from another node internally like I mentioned above

👍 1

Anirudh Atodaria

03/21/2025, 1:36 PM

Ah, that makes sense. Thats very helpful. Thank you so much, Json! Appreciate such detailed response 🙏 🙌

Jason Bosco

03/21/2025, 1:42 PM

Thank you for the question! We’ll probably create an article in the docs based on this

🙏 1

🙌 1

Anirudh Atodaria

03/21/2025, 1:51 PM

Thanks! That'd be plenty helpful for others 🫡

2 Views

Open in Slack

Previous Next