#community-help

Issues with Typesense and k8s Snapshot Restoration

TLDR Arnob experienced data loss and errors with Typesense in k8s. Kishore Nallan explained corruption could be from premature pod termination. To resolve, Kishore Nallan suggested deleting the data directory on the malfunctioning pod for automatic restoration from the leader.

Powered by Struct AI
9
4mo
Solved
Join the chat
Aug 06, 2023 (4 months ago)
Arnob
Photo of md5-b1aea561b79f980e6a366345edafe139
Arnob
08:08 AM
Hello,

I using Typesense in k8s. When i increase k8s resource, it deleted all the collection and also not restore the snapshot.

Here the error log:
 E20230806 07:06:25.193410     1 store.h:68] Error while initializing store: Corruption: file is too short (1158 bytes) to be an sstable/data/db/000320.sst in file /data/db/MANIFEST-000095
E20230806 07:06:25.288295   128 store.h:68] Error while initializing store: Corruption: file is too short (1158 bytes) to be an sstable/data/db/000320.sst in file /data/db/MANIFEST-000095
E20230806 07:06:25.289043   128 raft_server.h:279] Met peering error {type=StateMachineError, error_code=-1, error_text=`StateMachine on_snapshot_load failed'}
E20230806 07:06:25.289099   115 snapshot_executor.cpp:393] Fail to load snapshot from local:///data/state/snapshot
E20230806 07:06:25.289158   115 node.cpp:557] node default_group:10.4.11.74:8107:8108 init_snapshot_storage failed
E20230806 07:06:25.289209   115 raft_server.cpp:126] Fail to init peering node
E20230806 07:06:25.289883   115 typesense_server_utils.cpp:276] Failed to start peering state
E20230806 07:12:41.508708     1 auth_manager.cpp:263] Scoped API keys can only be used for searches.
E20230806 07:12:41.919620     1 auth_manager.cpp:263] Scoped API keys can only be used for searches.
E20230806 07:12:44.110736     1 auth_manager.cpp:263] Scoped API keys can only be used for searches.
I20230806 07:06:25.151355     1 typesense_server_utils.cpp:331] Starting Typesense 0.25.0.rc45
I20230806 07:06:25.151386     1 typesense_server_utils.cpp:334] Typesense is using jemalloc.
I20230806 07:06:25.151721     1 typesense_server_utils.cpp:384] Thread pool size: 16
I20230806 07:06:25.166436     1 store.h:64] Initializing DB by opening state dir: /data/db
I20230806 07:06:25.193804     1 store.h:64] Initializing DB by opening state dir: /data/meta
I20230806 07:06:25.228266     1 ratelimit_manager.cpp:546] Loaded 0 rate limit rules.
I20230806 07:06:25.228302     1 ratelimit_manager.cpp:547] Loaded 0 rate limit bans.
I20230806 07:06:25.228466     1 typesense_server_utils.cpp:495] Starting API service...
I20230806 07:06:25.228753   115 typesense_server_utils.cpp:232] Since no --nodes argument is provided, starting a single node Typesense cluster.
I20230806 07:06:25.229040     1 http_server.cpp:178] Typesense has started listening on port 8108
I20230806 07:06:25.229212   116 batched_indexer.cpp:124] Starting batch indexer with 16 threads.
I20230806 07:06:25.248844   115 server.cpp:1107] Server[braft::RaftStatImpl+braft::FileServiceImpl+braft::RaftServiceImpl+braft::CliServiceImpl] is serving on port=8107.
I20230806 07:06:25.248876   115 server.cpp:1110] Check out  in web browser.
I20230806 07:06:25.249287   115 raft_server.cpp:68] Nodes configuration: 10.4.11.74:8107:8108
I20230806 07:06:25.256830   115 log.cpp:690] Use murmurhash32 as the checksum type of appending entries
I20230806 07:06:25.257787   116 batched_indexer.cpp:129] BatchedIndexer skip_index: -9999
I20230806 07:06:25.263262   115 log.cpp:1172] log load_meta /data/state/log/log_meta first_log_index: 1 time: 6369
I20230806 07:06:25.263456   115 log.cpp:1112] load open segment, path: /data/state/log first_index: 1
I20230806 07:06:25.281968   128 raft_server.cpp:529] on_snapshot_load
I20230806 07:06:25.282536   128 store.h:299] rm /data/db success
I20230806 07:06:25.282799   128 store.h:309] copy snapshot /data/state/snapshot/snapshot_00000000000000000173/db_snapshot to /data/db success
I20230806 07:06:25.282831   128 store.h:64] Initializing DB by opening state dir: /data/db
W20230806 07:06:25.288667   128 store.h:319] Open DB /data/db failed, msg: Corruption: file is too short (1158 bytes) to be an sstable/data/db/000320.sst in file /data/db/MANIFEST-000095
I20230806 07:06:25.288950   128 snapshot_executor.cpp:264] node default_group:10.4.11.74:8107:8108 snapshot_load_done, last_included_index: 173 last_included_term: 16 peers: "10.4.8.106:8107:8108"
I20230806 07:06:25.289275   115 node.cpp:961] node default_group:10.4.11.74:8107:8108 shutdown, current_term 0 state UNINITIALIZED
W20230806 07:06:25.289113   128 node.cpp:1311] node default_group:10.4.11.74:8107:8108 got error={type=StateMachineError, error_code=-1, error_text=`StateMachine on_snapshot_load failed'}
I20230806 07:06:25.289458   128 raft_server.h:275] This node is down
I20230806 07:07:26.264640   116 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0
I20230806 07:08:27.271401   116 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0
I20230806 07:09:28.278858   116 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0
I20230806 07:10:29.286155   116 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0
I20230806 07:11:30.293216   116 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0
I20230806 07:12:31.301210   116 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0
I20230806 07:13:32.308427   116 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0
I20230806 07:14:33.315384   116 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0
I20230806 07:15:34.322649   116 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0
I20230806 07:16:35.329157   116 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 0

Attn: Kishore Nallan, Sai
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:09 PM
Looks like there has been a corruption of data on disk. Since I don't have much familiarity with k8s, I won't be able to help much with k8s issues.
Sai
Photo of md5-6c1f32ee79d1166e54e3f17345b8d814
Sai
03:18 PM
Any particular scenario where data gets corrupted?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:19 PM
I've never seen a single instance of corruption on any of the clusters we run on TS cloud. This is likely because of pods being prematurely killed etc.
Sai
Photo of md5-6c1f32ee79d1166e54e3f17345b8d814
Sai
03:29 PM
Like getting killed during a snapshot?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:29 PM
Yes
Sai
Photo of md5-6c1f32ee79d1166e54e3f17345b8d814
Sai
03:50 PM
Is there a curl command to restore snapshot ? That would solve the problem In Kubernetes
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:13 PM
Snapshot is just a directory produced. If you just restarted Typesense from that directory it will start. This is automatically handled in clustering. Just delete the data directory on the malfunctioning pod and it will fetch and restore data from leader.
Aug 08, 2023 (4 months ago)
Arnob
Photo of md5-b1aea561b79f980e6a366345edafe139
Arnob
09:29 AM
so is there any good way to taking Snapshot and restore in k8s!

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3005 threads (79% resolved)

Join Our Community

Similar Threads