Hello I have an issue using Typesense in production I don t typesense #community-help

Hello I have an issue using Typesense in productio...

Arthur DION

10/18/2021, 3:29 PM

Hello I have an issue using Typesense in production. I don't know why but typesense seems to periodically fail during snapshotting error_code=5. (Happened 3 times on the last month). Since I don't use typesense in cluster it appear the service is down after such a snapshotting. (I am using a docker and docker-compose to run it on a AWS instance) do you have any idea where that can come from ? (Do you want me to describe this issue extensively on a GitHub Issue ?

Kishore Nallan

10/18/2021, 3:30 PM

👋 Can you please tell me what version you are using?

Kishore Nallan

10/18/2021, 3:49 PM

Also, did you verify that the container did not hit memory limits since by default Docker containers have strict memory limits.

Arthur DION

10/18/2021, 4:11 PM

I use 0.21.0 I actually did not !

Kishore Nallan

10/18/2021, 4:12 PM

Happy to troubleshoot further if you can rule out the memory part.

Arthur DION

10/18/2021, 4:14 PM

Sorry I forgot to mention I have a pretty small data set (20K element) How to check the memory part of my docker (I actually never did it)

Arthur DION

10/18/2021, 4:15 PM

By the way the logs of the 2 types of error I received

Copy code

W20211017 09:51:23.784459    86 store.h:329] Checkpoint CreateCheckpoint failed at snapshot path: /data/state/snapshot/temp/db_snapshot, msg:IO error: While opening a file for sequentially reading: /data/db/MANIFEST-000007: No such file or directory
E20211017 09:51:23.784471    86 raft_server.cpp:476] Failure during checkpoint creation, msg:IO error: While opening a file for sequentially reading: /data/db/MANIFEST-000007: No such file or directory
I20211017 09:51:23.784518    86 raft_server.cpp:412] save_snapshot called
I20211017 09:51:23.784536    86 snapshot.cpp:516] Deleting /data/state/snapshot/temp
W20211017 09:51:23.784584    86 snapshot_executor.cpp:220] node default_group:172.20.0.2:8107:443 fail to close writer
E20211017 09:51:23.787825    84 raft_server.h:249] Met peering error {type=SnapshotError, error_code=5, error_text=`Fail to save snapshot'}
W20211017 09:51:23.789753    84 node.cpp:1264] node default_group:172.20.0.2:8107:443 got error={type=SnapshotError, error_code=5, error_text=`Fail to save snapshot'}
I20211017 09:51:23.789777    84 replicator.cpp:1475] Group default_group Fail to find the next candidate
I20211017 09:51:23.789790    84 raft_server.h:241] Node stepped down : Raft node(leader or candidate) is in error.
E20211017 09:51:23.794023     1 raft_server.cpp:242] Rejecting write: could not find a leader.
I20211017 09:51:23.794960    86 raft_server.cpp:717] Dummy write to <https://172.20.0.2:443/health>, status = 500, response = {"message": "Could not find a leader."}
I20211017 09:51:23.794975    86 raft_server.cpp:461] save_snapshot done
====
I20211018 09:56:43.576392    91 raft_server.h:58] Peer refresh succeeded!
I20211018 09:56:53.577672    81 raft_server.cpp:565] Term: 2, last_index index: 543, committed_index: 543, known_applied_index: 541, applying_index: 0, pending_index: 0, disk_index: 542, pending_queue_size: 0, local_sequence: 9473
I20211018 09:56:53.577786    88 raft_server.h:58] Peer refresh succeeded!
I20211018 09:57:03.517000    86 node.cpp:911] node default_group:172.20.0.2:8107:443 starts to do snapshot
I20211018 09:57:03.517168    86 raft_server.cpp:468] on_snapshot_save
I20211018 09:57:03.521546    86 raft_server.cpp:412] save_snapshot called
I20211018 09:57:03.527626    86 snapshot.cpp:638] Deleting /data/state/snapshot/snapshot_00000000000000000543
I20211018 09:57:03.527647    86 snapshot.cpp:644] Renaming /data/state/snapshot/temp to /data/state/snapshot/snapshot_00000000000000000543
I20211018 09:57:03.527663    86 snapshot.cpp:516] Deleting /data/state/snapshot/snapshot_00000000000000000541
I20211018 09:57:03.527858    86 snapshot_executor.cpp:234] node default_group:172.20.0.2:8107:443 snapshot_save_done, last_included_index=543 last_included_term=2
F20211018 09:57:03.527879    86 configuration_manager.cpp:48] Check failed: entry.id >= _snapshot.id ((index=543,term=2) vs. (index=541,term=3)) 
E20211018 09:57:04.462241    86 backward.hpp:4203] Stack trace (most recent call last) in thread 86:
E20211018 09:57:04.462963    86 backward.hpp:4203] #12   Object "/opt/typesense-server", at 0xe6f480, in bthread_make_fcontext
E20211018 09:57:04.462975    86 backward.hpp:4203] #11   Object "/opt/typesense-server", at 0xd1432e, in bthread::TaskGroup::task_runner(long)
E20211018 09:57:04.462987    86 backward.hpp:4203] #10   Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/snapshot_executor.cpp", line 312, in continue_run [0xc863a6]
E20211018 09:57:04.462994    86 backward.hpp:4203] #9    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/snapshot_executor.cpp", line 235, in on_snapshot_save_done [0xc838f8]
E20211018 09:57:04.463001    86 backward.hpp:4203] #8    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/log_manager.cpp", line 642, in set_snapshot [0xcbc5ba]
E20211018 09:57:04.463007    86 backward.hpp:4203] #7    Source "/opt/braft-c649789133566dc06e39ebd0c69a824f8e98993a/src/braft/configuration_manager.cpp", line 48, in set_snapshot [0xcd1271]
E20211018 09:57:04.463013    86 backward.hpp:4203] #6    Object "/opt/typesense-server", at 0x1456a61, in google::LogMessageFatal::~LogMessageFatal()
E20211018 09:57:04.463021    86 backward.hpp:4203] #5    Object "/opt/typesense-server", at 0x1453587, in google::LogMessage::Flush()
E20211018 09:57:04.463027    86 backward.hpp:4203] #4    Object "/opt/typesense-server", at 0x1453c45, in google::LogMessage::SendToLog()
E20211018 09:57:04.463033    86 backward.hpp:4203] #3    Object "/opt/typesense-server", at 0x1453ce7, in google::LogMessage::Fail()
E20211018 09:57:04.463039    86 backward.hpp:4203] #2    Object "/opt/typesense-server", at 0x145af7f, in google::DumpStackTraceAndExit()
E20211018 09:57:04.463050    86 backward.hpp:4203] #1    Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f4199a2d039, in abort
E20211018 09:57:04.463057    86 backward.hpp:4203] #0    Object "/lib/x86_64-linux-gnu/libc-2.23.so", at 0x7f4199a2b438, in raise

Kishore Nallan

10/18/2021, 4:20 PM

Oh this is a frustrating google log issue. I have a fix for this in the next upcoming release.

Kishore Nallan

10/18/2021, 4:21 PM

The crash must have been because of this.

Arthur DION

10/19/2021, 8:22 AM

The second one yes, the first one there is an error opening the file to do the snapshot

Copy code

msg:IO error: While opening a file for sequentially reading: /data/db/MANIFEST-000007: No such file or directory
E20211017 09:51:23.784471    86 raft_server.cpp:476] Failure during checkpoint creation, msg:IO error: While opening a file for

Kishore Nallan

10/19/2021, 8:22 AM

Hmm I have never seen that error before.

Kishore Nallan

10/19/2021, 8:23 AM

Seems like an I/O error. Not much information there apart from that.

Arthur DION

10/19/2021, 8:29 AM

Yep I know My issue is that put the node in error mode and so stop responding even if this is an error that doesn't imply that my service is going down

Arthur DION

10/19/2021, 8:30 AM

Should I create 2 new issue on github so you can track them directly over there ?

Kishore Nallan

10/19/2021, 8:51 AM

Yes, please. The first one involving glog should already be fixed in the next version. As for the second one, it will be good to track, maybe can be a reference for others to chime in with other data points. Btw, you can set your Docker container to restart automatically on failure I think.

Arthur DION

10/19/2021, 9:11 AM

So only the second one or the first one too ?

Arthur DION

10/19/2021, 9:12 AM

in order to work the docker has to stop non ? Because the service did not stop it is just in error state ! Serving error 500

Kishore Nallan

10/19/2021, 10:44 AM

Got it. Maybe the service should just stop and be resurrected. Please create issues for both, mentioning this detail as well. ty

Arthur DION

10/19/2021, 11:15 AM

Yep great thank you

Arthur DION

10/19/2021, 12:40 PM

Done #412 and #413 don't hesitate if you need anything else Thanks !

Open in Slack

Previous Next