#community-help

Node Boot Errors in Typesense Cluster

TLDR Sergio experienced boot errors on a Typesense cluster node. Kishore Nallan suggested deleting the data directory and restarting, which resolved the issue.

Powered by Struct AI
6
7mo
Solved
Join the chat
Feb 27, 2023 (7 months ago)
Sergio
Photo of md5-19856b8e92142bdd0747d7a3706736c8
Sergio
02:58 PM
Hey :typesense: Team, this morning we had a node with errors while booting, and was looking for some ideas.
We have a cluster of 3 nodes. While starting the third node (2 healthy nodes) we had the follow errors.
Any thoughts?
02:59
Sergio
02:59 PM
We were rolling OS updated on the hosts.

On the reboot of the Typesense node we got this error

I20230227 13:38:14.356796    94 store.h:307] DB open success!
I20230227 13:38:14.356813    94 raft_server.cpp:479] Loading collections from disk...
I20230227 13:38:14.356819    94 collection_manager.cpp:132] CollectionManager::load()
I20230227 13:38:14.356945    94 auth_manager.cpp:32] Indexing 9 API key(s) found on disk.
I20230227 13:38:14.357082    94 collection_manager.cpp:152] Loading upto 8 collections in parallel, 1000 documents at a time.
I20230227 13:38:14.357206    94 collection_manager.cpp:159] Found 2 collection(s) on disk.
I20230227 13:38:14.362864   131 collection_manager.cpp:83] Found collection products-8 with 4 memory shards.
I20230227 13:38:14.365340   130 collection_manager.cpp:83] Found collection products-7 with 4 memory shards.
I20230227 13:38:14.368235   130 collection_manager.cpp:1114] Loading collection products-7
I20230227 13:38:14.371126   131 collection_manager.cpp:1114] Loading collection products-8
E20230227 13:38:14.691952    85 backward.hpp:4199] Stack trace (most recent call last) in thread 85:
E20230227 13:38:14.701339    85 backward.hpp:4199] #5    Object "", at 0xffffffffffffffff, in
E20230227 13:38:14.702204    85 backward.hpp:4199] #4    Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f84dcbb251c, in __clone
E20230227 13:38:14.702880    85 backward.hpp:4199] #3    Object "/lib/x86_64-linux-gnu/libpthread-2.23.so", at 0x7f84dd5916b9, in start_thread
E20230227 13:38:14.703495    85 backward.hpp:4199] #2    Source "../../../../../libstdc++-v3/src/c++11/thread.cc", line 80, in execute_native_thread_routine [0x14ae6cf]
E20230227 13:38:14.704129    85 backward.hpp:4199] #1    Source "/typesense/src/batched_indexer.cpp", line 123, in run [0x50cd91]
E20230227 13:38:14.704944    85 backward.hpp:4199] #0    Source "/typesense/include/store.h", line 169, in scan [0x512925]
Segmentation fault (Address not mapped to object [(nil)])
E20230227 13:38:15.616197    85 typesense_server.cpp:95] Typesense 0.23.1 is terminating abruptly.
03:00
Sergio
03:00 PM
Then on the second reboot the node froze on
I20230227 14:21:17.644379     1 http_server.cpp:177] Typesense has started listening on port 8108
I20230227 14:21:17.644613   102 batched_indexer.cpp:120] Starting batch indexer with 16 threads.
I20230227 14:21:17.645495   102 batched_indexer.cpp:126] BatchedIndexer skip_index: -9999

And never gets "healthy"
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:07 PM
Have you tried just deleting the data directory and restarting it? The node will be able to get the snapshot from the leader and resume.
04:07
Kishore Nallan
04:07 PM
I suspect a data corruption on disk, rare but can happen.
Sergio
Photo of md5-19856b8e92142bdd0747d7a3706736c8
Sergio
10:00 PM
It's what we actually ended up doing. Super weird