Hey there everyone! I had a quick question, we ha...
# community-help
w
Hey there everyone! I had a quick question, we have 3 node typesense cluster and we've had some issues with nodes not syncing all the time. We'll have the case sometimes where 1 node in the entire cluster has say half the records for each collection as the other 2 nodes. To my understanding typesense should handle this discrepancy, currently terminating the node with the messed up collections and letting the cluster scale back to 3 nodes fixes the issue. But this is pretty concerning given that we'd be giving results from our API that'd vary depending on which node is queried through the load balancer. Is there a way to force a resync after we do a full reindex of our cluster? Any help would be appreciated, thanks!
k
Are you using Kubernetes?
w
Hey there, no we're using a EC2 Autoscale Group in AWS.
Just some more info, I currently have this node with a lesser amount of items in the collection, it's a follower node
And you can see the result here when I make requests
k
Did you check the logs to see if there are any additional errors logged in the nodes with missing data? We have not seen any such issue on our cloud environment recently. So this could be a data related or environment specific issue.
w
I haven't seen any errors like that but I can take a look again, I'm not at work currently so I'd have to look into this in the morning, would it be possible for someone to take a look at this then? Again I appreciate any help
k
We offer only limited guidance for self hosting because there are so many unknowns across different deployment setups so it's not possible for us to dig deep.
w
Okay so I just went through the logs quickly, and here are some things that may be something? 1. Queued writes > read lag 2. Queued writes > write lag 3. running GC for aborted requests 4. Then some unlinking stuff after a snapshot but I think this may just be normal behaviour.
Was able to access off network 😏😂
anywho, do any of these messages point to the phenomenon occurring?
k
Looks like your machine is unable to catch up with the writes. Do you have incoming writes?
w
yeah we have real time writes as well as full reindexes, I usually see this issue when it does happen after this reindex. I also see this issue when adding a new node to the cluster and having it try to sync with the existing nodes, so in both cases I have seen this issue before
I was wondering, how many records should we be ingesting at a time in batch? We can have at time 20 or so parallel processes doing writes to the typesense instance
k
Do you use import api?
w
yeah we use the typesense python client
k
Import or is each process writing individual writes?
Best way to improve throughput is to batch writes every few seconds and use import api
w
we use the import api like so, sending 40 items per batch write per process
k
That's fine. Maybe your instances are under provisioned to handle the load.
You can send more items batch, easily thousand docs
w
what configuration should we be looking at? we currently have a configuration using 4 vcpus and 16 gigs of ram per instance
k
Try using more CPU
w
Yep, sounds good, I'll give that a try
Copy code
2024-08-27T14:47:31.543Z
	W20240827 14:47:31.324640 4182 replicator.cpp:883] Group default_group Fail to install snapshot at peer=10.62.58.76:8107:8108, [E116][10.62.58.76:8107][E116]Loading a stale snapshot last_applied_index=5 last_applied_term=8 snapshot_index=46366 snapshot_term=7
	Link 
	
2024-08-27T14:47:32.045Z
	W20240827 14:47:31.446879 4183 socket.cpp:1340] Fail to wait EPOLLOUT of fd=29: Connection timed out [110]
	Link 
	
2024-08-27T14:47:32.045Z
	I20240827 14:47:31.825054 4183 replicator.cpp:834] node default_group:10.62.58.21:8107:8108 send InstallSnapshotRequest to 10.62.58.76:8107:8108 term 9 last_included_term 7 last_included_index 46366 uri <remote://10.62.58.21:8107/140140020985305903>
	Link 
	
2024-08-27T14:47:33.548Z
	I20240827 14:47:31.961494 4183 node.cpp:754] node default_group:10.62.58.21:8107:8108 waits peer 10.62.58.76:8107:8108 to catch up
	Link 
	
2024-08-27T14:47:35.054Z
	W20240827 14:47:33.344660 4183 replicator.cpp:297] Group default_group fail to issue RPC to 10.62.58.91:8107:8108 _consecutive_error_times=14061, [E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R1][E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R2][E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R3][E112]Not connected to 10.62.58.91:8107 yet, server_id=281
	Link 
	
2024-08-27T14:47:36.056Z
	W20240827 14:47:34.947259 4183 socket.cpp:1340] Fail to wait EPOLLOUT of fd=29: Connection timed out [110]
	Link 
	
2024-08-27T14:47:36.559Z
	W20240827 14:47:35.845861 4183 replicator.cpp:297] Group default_group fail to issue RPC to 10.62.58.91:8107:8108 _consecutive_error_times=14071, [E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R1][E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R2][E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R3][E112]Not connected to 10.62.58.91:8107 yet, server_id=281
	Link 
	
2024-08-27T14:47:36.559Z
	E20240827 14:47:36.450232 4142 raft_server.cpp:772] 1791 queued writes > healthy read lag of 1000
	Link 
	
2024-08-27T14:47:37.059Z
	E20240827 14:47:36.450322 4142 raft_server.cpp:784] 1791 queued writes > healthy write lag of 500
	Link 
	
2024-08-27T14:47:37.059Z
	I20240827 14:47:36.961716 4183 node.cpp:754] node default_group:10.62.58.21:8107:8108 waits peer 10.62.58.76:8107:8108 to catch up
	Link 
	
2024-08-27T14:47:37.059Z
	I20240827 14:47:36.984329 4186 replicator.cpp:881] received InstallSnapshotResponse from default_group:10.62.58.76:8107:8108 last_included_index 46366 last_included_term 7 error: [E116][10.62.58.76:8107][E116]Loading a stale snapshot last_applied_index=5 last_applied_term=8 snapshot_index=46366 snapshot_term=7
	Link 
	
2024-08-27T14:47:37.561Z
	W20240827 14:47:36.984382 4186 replicator.cpp:883] Group default_group Fail to install snapshot at peer=10.62.58.76:8107:8108, [E116][10.62.58.76:8107][E116]Loading a stale snapshot last_applied_index=5 last_applied_term=8 snapshot_index=46366 snapshot_term=7
	Link 
	
2024-08-27T14:47:37.561Z
	I20240827 14:47:37.450559 4142 raft_server.cpp:693] Term: 9, pending_queue: 0, last_index: 46519, committed: 46519, known_applied: 46519, applying: 0, pending_writes: 0, queued_writes: 1791, local_sequence: 501266
	Link 
	
2024-08-27T14:47:37.561Z
	W20240827 14:47:37.450608 4142 node.cpp:843] [default_group:10.62.58.21:8107:8108 ] Refusing concurrent configuration changing
	Link 
	
2024-08-27T14:47:37.561Z
	E20240827 14:47:37.450700 4184 raft_server.h:62] Peer refresh failed, error: Doing another configuration change
This may also be a related issue, where the queued writes seems to be infinitely stuck.
k
It says connection failure.
Unfortunately we don't have enough bandwidth to dedicate to debugging self hosting issues in detail.
w
Yeah that makes sense, thanks for the help either way!