Hey there everyone I had a quick question we have 3 node typ typesense #community-help

Hey there everyone! I had a quick question, we ha...

Wahid Bawa

08/26/2024, 4:55 AM

Hey there everyone! I had a quick question, we have 3 node typesense cluster and we've had some issues with nodes not syncing all the time. We'll have the case sometimes where 1 node in the entire cluster has say half the records for each collection as the other 2 nodes. To my understanding typesense should handle this discrepancy, currently terminating the node with the messed up collections and letting the cluster scale back to 3 nodes fixes the issue. But this is pretty concerning given that we'd be giving results from our API that'd vary depending on which node is queried through the load balancer. Is there a way to force a resync after we do a full reindex of our cluster? Any help would be appreciated, thanks!

Kishore Nallan

08/26/2024, 5:04 AM

Are you using Kubernetes?

Wahid Bawa

08/26/2024, 1:18 PM

Hey there, no we're using a EC2 Autoscale Group in AWS.

Wahid Bawa

08/26/2024, 8:00 PM

Just some more info, I currently have this node with a lesser amount of items in the collection, it's a follower node

Wahid Bawa

08/26/2024, 8:05 PM

And you can see the result here when I make requests

Kishore Nallan

08/27/2024, 4:38 AM

Did you check the logs to see if there are any additional errors logged in the nodes with missing data? We have not seen any such issue on our cloud environment recently. So this could be a data related or environment specific issue.

Wahid Bawa

08/27/2024, 4:41 AM

I haven't seen any errors like that but I can take a look again, I'm not at work currently so I'd have to look into this in the morning, would it be possible for someone to take a look at this then? Again I appreciate any help

Kishore Nallan

08/27/2024, 4:43 AM

We offer only limited guidance for self hosting because there are so many unknowns across different deployment setups so it's not possible for us to dig deep.

Wahid Bawa

08/27/2024, 4:55 AM

Okay so I just went through the logs quickly, and here are some things that may be something? 1. Queued writes > read lag 2. Queued writes > write lag 3. running GC for aborted requests 4. Then some unlinking stuff after a snapshot but I think this may just be normal behaviour.

Wahid Bawa

08/27/2024, 4:55 AM

Was able to access off network 😏😂

Wahid Bawa

08/27/2024, 4:56 AM

anywho, do any of these messages point to the phenomenon occurring?

Kishore Nallan

08/27/2024, 4:58 AM

Looks like your machine is unable to catch up with the writes. Do you have incoming writes?

Wahid Bawa

08/27/2024, 4:59 AM

yeah we have real time writes as well as full reindexes, I usually see this issue when it does happen after this reindex. I also see this issue when adding a new node to the cluster and having it try to sync with the existing nodes, so in both cases I have seen this issue before

Wahid Bawa

08/27/2024, 5:00 AM

I was wondering, how many records should we be ingesting at a time in batch? We can have at time 20 or so parallel processes doing writes to the typesense instance

Kishore Nallan

08/27/2024, 5:00 AM

Do you use import api?

Wahid Bawa

08/27/2024, 5:00 AM

yeah we use the typesense python client

Kishore Nallan

08/27/2024, 5:01 AM

Import or is each process writing individual writes?

Kishore Nallan

08/27/2024, 5:01 AM

Best way to improve throughput is to batch writes every few seconds and use import api

Wahid Bawa

08/27/2024, 5:02 AM

we use the import api like so, sending 40 items per batch write per process

Kishore Nallan

08/27/2024, 5:03 AM

That's fine. Maybe your instances are under provisioned to handle the load.

Kishore Nallan

08/27/2024, 5:03 AM

You can send more items batch, easily thousand docs

Wahid Bawa

08/27/2024, 5:04 AM

what configuration should we be looking at? we currently have a configuration using 4 vcpus and 16 gigs of ram per instance

Kishore Nallan

08/27/2024, 5:25 AM

Try using more CPU

Wahid Bawa

08/27/2024, 6:09 AM

Yep, sounds good, I'll give that a try

Wahid Bawa

08/27/2024, 2:49 PM

Copy code

2024-08-27T14:47:31.543Z
	W20240827 14:47:31.324640 4182 replicator.cpp:883] Group default_group Fail to install snapshot at peer=10.62.58.76:8107:8108, [E116][10.62.58.76:8107][E116]Loading a stale snapshot last_applied_index=5 last_applied_term=8 snapshot_index=46366 snapshot_term=7
	Link 
	
2024-08-27T14:47:32.045Z
	W20240827 14:47:31.446879 4183 socket.cpp:1340] Fail to wait EPOLLOUT of fd=29: Connection timed out [110]
	Link 
	
2024-08-27T14:47:32.045Z
	I20240827 14:47:31.825054 4183 replicator.cpp:834] node default_group:10.62.58.21:8107:8108 send InstallSnapshotRequest to 10.62.58.76:8107:8108 term 9 last_included_term 7 last_included_index 46366 uri <remote://10.62.58.21:8107/140140020985305903>
	Link 
	
2024-08-27T14:47:33.548Z
	I20240827 14:47:31.961494 4183 node.cpp:754] node default_group:10.62.58.21:8107:8108 waits peer 10.62.58.76:8107:8108 to catch up
	Link 
	
2024-08-27T14:47:35.054Z
	W20240827 14:47:33.344660 4183 replicator.cpp:297] Group default_group fail to issue RPC to 10.62.58.91:8107:8108 _consecutive_error_times=14061, [E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R1][E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R2][E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R3][E112]Not connected to 10.62.58.91:8107 yet, server_id=281
	Link 
	
2024-08-27T14:47:36.056Z
	W20240827 14:47:34.947259 4183 socket.cpp:1340] Fail to wait EPOLLOUT of fd=29: Connection timed out [110]
	Link 
	
2024-08-27T14:47:36.559Z
	W20240827 14:47:35.845861 4183 replicator.cpp:297] Group default_group fail to issue RPC to 10.62.58.91:8107:8108 _consecutive_error_times=14071, [E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R1][E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R2][E112]Not connected to 10.62.58.91:8107 yet, server_id=281 [R3][E112]Not connected to 10.62.58.91:8107 yet, server_id=281
	Link 
	
2024-08-27T14:47:36.559Z
	E20240827 14:47:36.450232 4142 raft_server.cpp:772] 1791 queued writes > healthy read lag of 1000
	Link 
	
2024-08-27T14:47:37.059Z
	E20240827 14:47:36.450322 4142 raft_server.cpp:784] 1791 queued writes > healthy write lag of 500
	Link 
	
2024-08-27T14:47:37.059Z
	I20240827 14:47:36.961716 4183 node.cpp:754] node default_group:10.62.58.21:8107:8108 waits peer 10.62.58.76:8107:8108 to catch up
	Link 
	
2024-08-27T14:47:37.059Z
	I20240827 14:47:36.984329 4186 replicator.cpp:881] received InstallSnapshotResponse from default_group:10.62.58.76:8107:8108 last_included_index 46366 last_included_term 7 error: [E116][10.62.58.76:8107][E116]Loading a stale snapshot last_applied_index=5 last_applied_term=8 snapshot_index=46366 snapshot_term=7
	Link 
	
2024-08-27T14:47:37.561Z
	W20240827 14:47:36.984382 4186 replicator.cpp:883] Group default_group Fail to install snapshot at peer=10.62.58.76:8107:8108, [E116][10.62.58.76:8107][E116]Loading a stale snapshot last_applied_index=5 last_applied_term=8 snapshot_index=46366 snapshot_term=7
	Link 
	
2024-08-27T14:47:37.561Z
	I20240827 14:47:37.450559 4142 raft_server.cpp:693] Term: 9, pending_queue: 0, last_index: 46519, committed: 46519, known_applied: 46519, applying: 0, pending_writes: 0, queued_writes: 1791, local_sequence: 501266
	Link 
	
2024-08-27T14:47:37.561Z
	W20240827 14:47:37.450608 4142 node.cpp:843] [default_group:10.62.58.21:8107:8108 ] Refusing concurrent configuration changing
	Link 
	
2024-08-27T14:47:37.561Z
	E20240827 14:47:37.450700 4184 raft_server.h:62] Peer refresh failed, error: Doing another configuration change

This may also be a related issue, where the queued writes seems to be infinitely stuck.

Kishore Nallan

08/27/2024, 2:50 PM

It says connection failure.

Kishore Nallan

08/27/2024, 2:51 PM

Unfortunately we don't have enough bandwidth to dedicate to debugging self hosting issues in detail.

Wahid Bawa

08/27/2024, 3:02 PM

Yeah that makes sense, thanks for the help either way!

Open in Slack

Previous Next