Node Boot Errors in Typesense Cluster

TLDR Sergio experienced boot errors on a Typesense cluster node. Kishore Nallan suggested deleting the data directory and restarting, which resolved the issue.

Powered by Struct AI
Join the chat
Feb 27, 2023 (7 months ago)
Photo of md5-19856b8e92142bdd0747d7a3706736c8
02:58 PM
Hey :typesense: Team, this morning we had a node with errors while booting, and was looking for some ideas.
We have a cluster of 3 nodes. While starting the third node (2 healthy nodes) we had the follow errors.
Any thoughts?
02:59 PM
We were rolling OS updated on the hosts.

On the reboot of the Typesense node we got this error

I20230227 13:38:14.356796    94 store.h:307] DB open success!
I20230227 13:38:14.356813    94 raft_server.cpp:479] Loading collections from disk...
I20230227 13:38:14.356819    94 collection_manager.cpp:132] CollectionManager::load()
I20230227 13:38:14.356945    94 auth_manager.cpp:32] Indexing 9 API key(s) found on disk.
I20230227 13:38:14.357082    94 collection_manager.cpp:152] Loading upto 8 collections in parallel, 1000 documents at a time.
I20230227 13:38:14.357206    94 collection_manager.cpp:159] Found 2 collection(s) on disk.
I20230227 13:38:14.362864   131 collection_manager.cpp:83] Found collection products-8 with 4 memory shards.
I20230227 13:38:14.365340   130 collection_manager.cpp:83] Found collection products-7 with 4 memory shards.
I20230227 13:38:14.368235   130 collection_manager.cpp:1114] Loading collection products-7
I20230227 13:38:14.371126   131 collection_manager.cpp:1114] Loading collection products-8
E20230227 13:38:14.691952    85 backward.hpp:4199] Stack trace (most recent call last) in thread 85:
E20230227 13:38:14.701339    85 backward.hpp:4199] #5    Object "", at 0xffffffffffffffff, in
E20230227 13:38:14.702204    85 backward.hpp:4199] #4    Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f84dcbb251c, in __clone
E20230227 13:38:14.702880    85 backward.hpp:4199] #3    Object "/lib/x86_64-linux-gnu/libpthread-2.23.so", at 0x7f84dd5916b9, in start_thread
E20230227 13:38:14.703495    85 backward.hpp:4199] #2    Source "../../../../../libstdc++-v3/src/c++11/thread.cc", line 80, in execute_native_thread_routine [0x14ae6cf]
E20230227 13:38:14.704129    85 backward.hpp:4199] #1    Source "/typesense/src/batched_indexer.cpp", line 123, in run [0x50cd91]
E20230227 13:38:14.704944    85 backward.hpp:4199] #0    Source "/typesense/include/store.h", line 169, in scan [0x512925]
Segmentation fault (Address not mapped to object [(nil)])
E20230227 13:38:15.616197    85 typesense_server.cpp:95] Typesense 0.23.1 is terminating abruptly.
03:00 PM
Then on the second reboot the node froze on
I20230227 14:21:17.644379     1 http_server.cpp:177] Typesense has started listening on port 8108
I20230227 14:21:17.644613   102 batched_indexer.cpp:120] Starting batch indexer with 16 threads.
I20230227 14:21:17.645495   102 batched_indexer.cpp:126] BatchedIndexer skip_index: -9999

And never gets "healthy"
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:07 PM
Have you tried just deleting the data directory and restarting it? The node will be able to get the snapshot from the leader and resume.
Kishore Nallan
04:07 PM
I suspect a data corruption on disk, rare but can happen.
Photo of md5-19856b8e92142bdd0747d7a3706736c8
10:00 PM
It's what we actually ended up doing. Super weird