Hello I have an issue using Typesense in productio...
# community-help
a
Hello I have an issue using Typesense in production. I don't know why but typesense seems to periodically fail during snapshotting error_code=5. (Happened 3 times on the last month). Since I don't use typesense in cluster it appear the service is down after such a snapshotting. (I am using a docker and docker-compose to run it on a AWS instance) do you have any idea where that can come from ? (Do you want me to describe this issue extensively on a GitHub Issue ?
k
👋 Can you please tell me what version you are using?
Also, did you verify that the container did not hit memory limits since by default Docker containers have strict memory limits.
a
I use 0.21.0 I actually did not !
k
Happy to troubleshoot further if you can rule out the memory part.
a
Sorry I forgot to mention I have a pretty small data set (20K element) How to check the memory part of my docker (I actually never did it)
By the way the logs of the 2 types of error I received
Copy code
W20211017 09:51:23.784459    86 store.h:329] Checkpoint CreateCheckpoint failed at snapshot path: /data/state/snapshot/temp/db_snapshot, msg:IO error: While opening a file for sequentially reading: /data/db/MANIFEST-000007: No such file or directory
E20211017 09:51:23.784471    86 raft_server.cpp:476] Failure during checkpoint creation, msg:IO error: While opening a file for sequentially reading: /data/db/MANIFEST-000007: No such file or directory
I20211017 09:51:23.784518    86 raft_server.cpp:412] save_snapshot called
I20211017 09:51:23.784536    86 snapshot.cpp:516] Deleting /data/state/snapshot/temp
W20211017 09:51:23.784584    86 snapshot_executor.cpp:220] node default_group:172.20.0.2:8107:443 fail to close writer
E20211017 09:51:23.787825    84 raft_server.h:249] Met peering error {type=SnapshotError, error_code=5, error_text=`Fail to save snapshot'}
W20211017 09:51:23.789753    84 node.cpp:1264] node default_group:172.20.0.2:8107:443 got error={type=SnapshotError, error_code=5, error_text=`Fail to save snapshot'}
I20211017 09:51:23.789777    84 replicator.cpp:1475] Group default_group Fail to find the next candidate
I20211017 09:51:23.789790    84 raft_server.h:241] Node stepped down : Raft node(leader or candidate) is in error.
E20211017 09:51:23.794023     1 raft_server.cpp:242] Rejecting write: could not find a leader.
I20211017 09:51:23.794960    86 raft_server.cpp:717] Dummy write to <https://172.20.0.2:443/health>, status = 500, response = {"message": "Could not find a leader."}
I20211017 09:51:23.794975    86 raft_server.cpp:461] save_snapshot done
====
I20211018 09:56:43.576392    91 raft_server.h:58] Peer refresh succeeded!
I20211018 09:56:53.577672    81 raft_server.cpp:565] Term: 2, last_index index: 543, committed_index: 543, known_applied_index: 541, applying_index: 0, pending_index: 0, disk_index: 542, pending_queue_size: 0, local_sequence: 9473
I20211018 09:56:53.577786    88 raft_server.h:58] Peer refresh succeeded!
I20211018 09:57:03.517000    86 node.cpp:911] node default_group:172.20.0.2:8107:443 starts to do snapshot
I20211018 09:57:03.517168    86 raft_server.cpp:468] on_snapshot_save
I20211018 09:57:03.521546    86 raft_server.cpp:412] save_snapshot called
I20211018 09:57:03.527626    86 snapshot.cpp:638] Deleting /data/state/snapshot/snapshot_00000000000000000543
I20211018 09:57:03.527647    86 snapshot.cpp:644] Renaming /data/state/snapshot/temp to /data/state/snapshot/snapshot_00000000000000000543
I20211018 09:57:03.527663    86 snapshot.cpp:516] Deleting /data/state/snapshot/snapshot_00000000000000000541
I20211018 09:57:03.527858    86 snapshot_executor.cpp:234] node default_group:172.20.0.2:8107:443 snapshot_save_done, last_included_index=543 last_included_term=2
F20211018 09:57:03.527879    86 configuration_manager.cpp:48] Check failed: entry.id >= _snapshot.id ((index=543,term=2) vs. (index=541,term=3)) 
E20211018 09:57:04.462241    86 backward.hpp:4203] Stack trace (most recent call last) in thread 86:
E20211018 09:57:04.462963    86 backward.hpp:4203] #12   Object "/opt/typesense-server", at 0xe6f480, in bthread_make_fcontext
E20211018 09:57:04.462975    86 backward.hpp:4203] #11   Object "/opt/typesense-server", at 0xd1432e, in bthread::TaskGroup::task_runner(long)
E20211018 09:57:04.462987    86 backward.hpp:4203] #10   Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/snapshot_executor.cpp", line 312, in continue_run [0xc863a6]
E20211018 09:57:04.462994    86 backward.hpp:4203] #9    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/snapshot_executor.cpp", line 235, in on_snapshot_save_done [0xc838f8]
E20211018 09:57:04.463001    86 backward.hpp:4203] #8    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/log_manager.cpp", line 642, in set_snapshot [0xcbc5ba]
E20211018 09:57:04.463007    86 backward.hpp:4203] #7    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/configuration_manager.cpp", line 48, in set_snapshot [0xcd1271]
E20211018 09:57:04.463013    86 backward.hpp:4203] #6    Object "/opt/typesense-server", at 0x1456a61, in google::LogMessageFatal::~LogMessageFatal()
E20211018 09:57:04.463021    86 backward.hpp:4203] #5    Object "/opt/typesense-server", at 0x1453587, in google::LogMessage::Flush()
E20211018 09:57:04.463027    86 backward.hpp:4203] #4    Object "/opt/typesense-server", at 0x1453c45, in google::LogMessage::SendToLog()
E20211018 09:57:04.463033    86 backward.hpp:4203] #3    Object "/opt/typesense-server", at 0x1453ce7, in google::LogMessage::Fail()
E20211018 09:57:04.463039    86 backward.hpp:4203] #2    Object "/opt/typesense-server", at 0x145af7f, in google::DumpStackTraceAndExit()
E20211018 09:57:04.463050    86 backward.hpp:4203] #1    Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f4199a2d039, in abort
E20211018 09:57:04.463057    86 backward.hpp:4203] #0    Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f4199a2b438, in raise
k
Oh this is a frustrating google log issue. I have a fix for this in the next upcoming release.
The crash must have been because of this.
a
The second one yes, the first one there is an error opening the file to do the snapshot
Copy code
msg:IO error: While opening a file for sequentially reading: /data/db/MANIFEST-000007: No such file or directory
E20211017 09:51:23.784471    86 raft_server.cpp:476] Failure during checkpoint creation, msg:IO error: While opening a file for
k
Hmm I have never seen that error before.
Seems like an I/O error. Not much information there apart from that.
a
Yep I know My issue is that put the node in error mode and so stop responding even if this is an error that doesn't imply that my service is going down
Should I create 2 new issue on github so you can track them directly over there ?
k
Yes, please. The first one involving glog should already be fixed in the next version. As for the second one, it will be good to track, maybe can be a reference for others to chime in with other data points. Btw, you can set your Docker container to restart automatically on failure I think.
a
So only the second one or the first one too ?
in order to work the docker has to stop non ? Because the service did not stop it is just in error state ! Serving error 500
k
Got it. Maybe the service should just stop and be resurrected. Please create issues for both, mentioning this detail as well. ty
a
Yep great thank you
Done #412 and #413 don't hesitate if you need anything else Thanks !