hey is it possible expected to get different results from ea typesense #community-help

hey, is it possible (expected) to get different re...

EhmeedRls

02/21/2025, 12:33 PM

hey, is it possible (expected) to get different results from each node in multi-node cluster setup, when I don't use any

sort_by

clause? I am getting different results from the master node than from the 2 slave nodes.

Kishore Nallan

02/21/2025, 12:34 PM

You should not be getting that. The only reason that can happen if there was some failure in clustering or management plane that caused the nodes to go out of sync.

EhmeedRls

02/21/2025, 12:55 PM

Do you mean permanently out of sync? How would I detect that? When I ingest new items, they are correctly replicated to all nodes.

Fanis Tharropoulos

02/21/2025, 12:56 PM

You can perform a rotation by killing a node and re starting it

Kishore Nallan

02/21/2025, 12:57 PM

Are you hosting Typesense yourself?

EhmeedRls

02/21/2025, 12:58 PM

yes, we are

Kishore Nallan

02/21/2025, 1:08 PM

Then you have probably made an error while recovering from an outage or misconfiguration. Raft requires a quorum of 2/3 nodes to be available. If this is not carefully managed you can have inconsistent views.

Tomas Hauk

02/21/2025, 2:10 PM

hm, can you please clarify the statement above? TL;DR; the expectation is that during e.g. upgrade (which we did at v28) we expect some inconsistency -- thats no problem... but once we have all nodes up at the next version, the clusters should become consistent? we are pretty sure we followed these instructions correctly, most notably: 1.

Repeat steps 2 and 3 for the other _followers_, leaving the leader node uninterrupted for now.

2. we definitively did the restarts of the nodes only AFTER the restarted node rejoined the cluster AND restarted node was healthy AND restarted node started to serve requests => i.e. there was at most one node out of cluster, syncing 3. and also made sure that the master was the last node to restart (and upgrade) our expectation would be that since there is some kind of WAL log/checkpointing: • there will be created some inconsistency when the master as the last node is going down (there may be some ingest going on there) • but then some of the former follower nodes will become new master and the state from the new master node will get replicated into new followers (which probably means that some state will get rollbacked on the node that was the original master before the upgrade) can you please comment on the expectation above and/or correct it? we can definitely try to force reelection (e.g. make the original master master again, but it does not seem to explain what actually happened)

Tomas Hauk

02/21/2025, 2:16 PM

just to be sure. by "wal" log we probably mean the

commited

number in these logs which seem to be consistent (from all our nodes):

Copy code

I20250221 14:14:13.597611   184 raft_server.cpp:683] Term: 13, pending_queue: 0, last_index: 6148621, committed: 6148621, known_applied: 6148621, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 36410614

Copy code

I20250221 14:14:11.064772   184 raft_server.cpp:683] Term: 13, pending_queue: 0, last_index: 6148621, committed: 6148621, known_applied: 6148621, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 36405023

Copy code

I20250221 14:13:57.792254   185 raft_server.cpp:683] Term: 13, pending_queue: 0, last_index: 6148621, committed: 6148621, known_applied: 6148621, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 36410627

Kishore Nallan

02/21/2025, 2:58 PM

It's difficult to describe the corner cases that can happen but typically whenever we have seen this type of issues happen it's usually because one of the nodes was restarted before another node was fully caught up etc. Since the commit numbers match, atleast the raft log is upto-date, so you can try restarting a node to see if the number stabilize after that. For e.g. another failure that can happen is if one of the nodes are overwhelmed during upgrade, it could end up rejecting a write into the in-memory index.

Tomas Hauk

02/23/2025, 12:36 PM

Hm, can you please clarify this a bit more (or maybe point to relevant documentation / discussion about this topic). 1. if write can be in theory rejected to be written into memory index, wouldn't it be better to fail fast somehow and disconnect the out-of-sync node (maybe after some retries from replication log)? is this a design decision to allow this inconsistency rather than fail fast (as would probably happen with e.g. databases like postgres/kafka)? 2. is there some reliable way to monitor this particular out-of-sync state (in order to be able to react and fix it as it occurs and not only after we spot inconsistent results across instances)? if the raft log as provided above seems to be correct (i.e. in sync), is there other log/endpoint where this could be checked? 3. To be clear having inconsistencies for search use-case is not that of a big deal from our point of view (it can be always remedied by reingesting the data or maybe even restart as mentioned above), but cluster being silent about this issue (e.g. having healthy state while being inconsistent) seems to be a bit troubling. Would you consider it to be a good feature to be added and is it technically possible to provide this kind of monitoring in case of negative answer in point 2 above? We will try to restart nodes / perform leader reelection tomorrow and check whether this helps the issue.

Kishore Nallan

02/23/2025, 1:50 PM

The search system is built to favor availability since search requires crucial uptime guarantee for a lot of our customers. Unlike an ACID compliant primary database our design choices required some relaxed constraints. Some of these design choices mean that if you are not careful about managing the clustering the nodes could diverge. However we have maybe seen only 1-2 such instances in the last few years on Typesense Cloud. We do have safeguards in place to detect this during upgrades that run on Typesense Cloud. Unfortunately we can only offer limited support to self hosted clusters since a lot of things can go wrong and it's difficult for us to reason about failure cases without instrumentation and logs.

thankyou 1

👍 1

Open in Slack

Previous Next