#community-help

Discussion About Typesense Nodes Not Synchronizing Correctly

TLDR Erick experienced an issue where documents weren't updated properly in a Typesense instance running on 3 nodes. Upon requesting debug logs and configs, Jason identified that these nodes weren't part of the same cluster. They couldn't resolve the nodes' failure to connect issue and recommended a fresh installation.

Powered by Struct AI

1

1

Feb 03, 2022 (21 months ago)
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
10:20 PM
Hi all. When updating a document, how fast will this change take effect?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:20 PM
When the API call returns, the update has already been processed and will show up in search
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
10:21 PM
Even if I'm using firebase extension?
10:21
Erick
10:21 PM
I did a query and the value didn't change
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:21 PM
Oh the firebase extension delay - that I'm not sure. It depends on how fast Firestore calls the extension code after a document change
10:22
Jason
10:22 PM
I was talking about the time between the Typesense API call being made and when it is processed
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
10:22 PM
even though I checked in the dashboard that the document was changed
10:22
Erick
10:22 PM
But when I perform a query using shell
10:22
Erick
10:22 PM
the value is the old one.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:23 PM
I've not heard of any delays here though, unless it errored out. Could you check the Firebase function logs for the extension and see if any errors show up?
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
10:25 PM
The logs only show that the document was being upserting with the right values.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:27 PM
Hmm, if no errors show up there, then it should have made its way to Typesense. Could you make sure you're looking at the right collection, etc?
10:27
Jason
10:27 PM
Also, could you refresh the Typesense Cloud dashboard and try just in case
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
10:29 PM
Interesting. When I modify the document the first time
10:30
Erick
10:30 PM
The value changes in firebase and I'm seeing that it the extension is working at it should
10:30
Erick
10:30 PM
But the change doesn't take effect
10:31
Erick
10:31 PM
When I change the document a second time, everything goes as usual but this time I can see the change in typesense too.
10:32
Erick
10:32 PM
In short, I have to write the value 2 times before I see the change in typesense.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:33 PM
I suspect it's the search cache in Typesense at play then. Could you try hitting the Typesense API directly via curl, do a GET on the document with its ID directly to confirm this?
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
10:44 PM
I changed the value 5 times.

First time
• Change Firebase, Typesense-Ext executed, Typesense document result (Using curl) = old parameter.
Second time
• Same as first time
Third Time
• Wait to change the document, perform same steps, Typesense document result (Using curl) = new parameter.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:48 PM
Hmmm, hang on. Let me try to replicate this...
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
10:51 PM
Ok
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:08 PM
Hmmm, I can't seem to replicate it
11:08
Jason
11:08 PM
Could you share your cluster ID?
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
11:13 PM
We're using test servers for testing. We don't have a cluster Id yet.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:16 PM
I can't think of a reason why it would be flaky though... It should either work or not work fully. Unless there's a network connection issue between firestore and your server
11:16
Jason
11:16 PM
Could you double check that there's enough RAM as well on the server?
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
11:21 PM
We're running typesense on 3 nodes with 1GB RAM + 32GB storage.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:23 PM
I also tested on a 3 node cluster. Could you do a GET /debug on all three nodes and post the output?
11:24
Jason
11:24 PM
I wonder if one of the nodes is not part of the cluster
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
11:28 PM
Server 1

Debug output:
{
  "state": 1,
  "version": "0.22.1"
}

Server 2

Debug output:

{
  "state": 1,
  "version": "0.22.1"
}

Server 3

Debug output:

{
  "state": 1,
  "version": "0.22.1"
}

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:30 PM
Yup, that is indeed the issue. If a cluster was successfully established between all 3 nodes, you'll see state: 1 on one node and state: 4 on the other two nodes
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
11:30 PM
Oh.
11:31
Erick
11:31 PM
Ok.
11:31
Erick
11:31 PM
How would I fix this?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:31 PM
Could you share your Typesense configs? and nodes file content from all 3 nodes
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
11:31 PM
Yes
11:48
Erick
11:48 PM
Server 1

; Typesense Configuration

[server]

api-address = company-ip-1
api-port = 443
data-dir = /var/lib/typesense
log-dir = /var/log/typesense
api-key = typesense-api-key
peering-address = 192.168.199.212
peering-port = 8107
nodes = /etc/typesense/nodes
ssl-certificate = /etc/letsencrypt/live/test1.example.com/fullchain.pem
ssl-certificate-key = /etc/letsencrypt/live/test2.example.com/privkey.pem

Server 2

; Typesense Configuration

[server]

api-address = company-ip-2
api-port = 443
data-dir = /var/lib/typesense
log-dir = /var/log/typesense
api-key = typesense-api-key
peering-address = 192.168.199.3
peering-port = 8107
nodes = /etc/typesense/nodes
ssl-certificate = /etc/letsencrypt/live/test2.example.com/fullchain.pem
ssl-certificate-key = /etc/letsencrypt/live/test2.example.com/privkey.pem

Server 3

; Typesense Configuration

[server]

api-address = company-ip-3
api-port = 443
data-dir = /var/lib/typesense
log-dir = /var/log/typesense
api-key = typesense-api-key
peering-address = 192.168.199.25
peering-port = 8107
nodes = /etc/typesense/nodes
ssl-certificate = /etc/letsencrypt/live/test3.example.com/fullchain.pem
ssl-certificate-key = /etc/letsencrypt/live/test3.example.com/privkey.pem
11:54
Erick
11:54 PM
/etc/typesense/nodes

192.168.199.212:8107:443,192.168.199.3:8107:443,192.168.199.25:8107:443
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:57 PM
The configs seem fine... Could you use say telnet to ensure that port 8107 of all the nodes are accessible from within each node?
11:58
Jason
11:58 PM
Eg: telnet 192.168.199.3 8107 should show you a prompt when you run it from the other two hosts
Feb 04, 2022 (21 months ago)
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
12:06 AM
Servers are seeing each other at layer 3.
12:07
Erick
12:07 AM
but not telnet.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:11 AM
Yeah, something's up there then. Could you check firewall rules to make sure that port 8107 is allowed on all the nodes at least for the 192.168.x.x subnet
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
12:32 AM
Yes
12:32
Erick
12:32 AM
You were right. We configured the firewall to allow port 8107
12:33
Erick
12:33 AM
Now all servers can telnet to port 8107
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:33 AM
Great, if you now restart the typesense processes on all the servers, they should start forming a cluster
12:33
Jason
12:33 AM
You want to double check but hitting the /debug endpoint
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
12:33 AM
so, reboot all servers?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:34 AM
sudo systemctl restart typesense-server.service
12:34
Jason
12:34 AM
should be sufficient ^
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
12:39 AM
Nice. Now we have 1, 4, 4
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:39 AM
🙌

1

Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
12:40 AM
What happens if the servers had data?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:41 AM
They would have not been able to reconcile with each other since they most likely would have had different data, it sounded like writes were going to different nodes
12:41
Jason
12:41 AM
So you would have had to delete the data dir on two of the nodes and then start the cluster so the third node can sync the data to the other two nodes
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
12:43 AM
So, shutdown all nodes, erase data dir for 2 nodes, then restart all nodes again?
12:46
Erick
12:46 AM
We need to delete data dir or the content inside the dir?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:57 AM
The content inside the data dir.
12:57
Jason
12:57 AM
Could you first check the logs though
12:58
Jason
12:58 AM
If it says "Peer Refresh Succeeded" on the node with state 1, then you're good. You don't have to delete the data dir
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
01:02 AM
Where can I check these logs? using systemctl? debug only shows version and state.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:02 AM
/var/log/typesense/typesense.log
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
01:12 AM
It seems it failed. We're getting the following:

I20211215 05:19:19.205374   771 node.cpp:722] node default_group:192.168.199.3:8107:443 waits peer 192.168.199.212:8107:443 to catch up
I20211215 05:19:19.205453   771 node.cpp:722] node default_group:192.168.199.3:8107:443 waits peer 192.168.199.25:8107:443 to catch up```
W20211215 05:19:19.206537 774 replicator.cpp:392] Group default_group fail to issue RPC to 192.168.199.212:8107:443 _consecutive_error_times=11, [E112]Not connected to 192.168.199.212:8107 yet, server_id=206158430322 [R1][E112]Not connected to 192.168.199.212:8107 yet, server_id=206158430322 [R2][E112]Not connected to 192.168.199.212:8107 yet, server_id=206158430322 [R3][E112]Not connected to 192.168.199.212:8107 yet, server_id=206158430322


W20211215 05:19:19.206666 774 replicator.cpp:292] Group default_group fail to issue RPC to 192.168.199.25:8107:443 _consecutive_error_times=11, [E112]Not connected to 192.168.199.25:8107 yet, server_id=163208757684 [R1][E112]Not connected to 192.168.199.25:8107 yet, server_id=163208757684 [R2][E112]Not connected to 192.168.199.25:8107 yet, server_id=163208757684 [R3][E112]Not connected to 192.168.199.25:8107 yet, server_id=163208757684

W20211215 05:19:20.806144 771 socket.cpp:1193] Fail to wait EPOLLOUT of fd=28: Connection timed out [110]
W20211215 05:19:20.806241 771 socket.cpp:1193] Fail to wait EPOLLOUT of fd=26: Connection timed out [110]```
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:13 AM
Hmm, "Connection timed out" is a different issue - sounds like one of the nodes might still have trouble connecting
01:13
Jason
01:13 AM
In any case, I think it's good to clear the data dir from two of the nodes, and then restart them so they can catch up with the 3rd node
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
01:15 AM
Ok. Let me check>
01:40
Erick
01:40 AM
I'm getting the same result plus this: Peer refresh failed, error: Peer 192.168.199.212:8107:443 failed to catch up
01:43
Erick
01:43 AM
We've erased all folders from data dir on each node.
01:43
Erick
01:43 AM
all 3 nodes are having the same logs
01:45
Erick
01:45 AM
We're also having these at the beginning:

Running GC for aborted requests, req map size: 0
I20211214 17:47:19.990471   673 raft_server.cpp:524] Term: 5, last_index index: 5, committed_index: 5, known_applied_index: 5, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 1
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:45 AM
Ok let’s try this. Could you stop all three processes, clear the data dir on all nodes, then on one of the nodes edit the nodes file to just have its own IP and start the Typesense process up. It should log “peer refresh succeeded”. Then add the 2nd nodes IPs into the nodes file and start the Typesense process on the 2nd node, same for 3rd node
01:46
Jason
01:46 AM
The last log lines you shared are normal
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
01:46 AM
Ok. Let me try it.
01:47
Erick
01:47 AM
Should we edit /etc/typesense/nodes too?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:47 AM
Yes, by “nodes file” I meant edit /etc/typesense/nodes
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
01:54 AM
Running systemctl status We get the following:

Started Typesense Server.
Log directory is configured as: /var/log/typesense
Peer refresh failed, error: Doing another configuration change
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:55 AM
You want to look at the var logs
01:55
Jason
01:55 AM
It should eventually log peer refresh succeeded
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
01:57 AM
It's interesting. The log shows the server trying to connect to the other 2 even though they are off and the /etc/typesense/nodes has the ip with ports (8107:443).
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:59 AM
That means the data dir from the previous run is still intact
01:59
Jason
01:59 AM
You want to stop the Typesense process, clear the data dir, make sure it’s fully empty and start the Typesense process again
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
02:03 AM
That's weird. We stopped the service using systemctl stop typesense-server.service. Then we proceed to erase db, meta and state folders from /var/lib/typesense and then we proceed to restart the service.
02:08
Erick
02:08 AM
Do you think we need to wait some time before restarting the service? Because it seems the service is using a cached config when restarting
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:11 AM
Can you try stopping all 3 nodes, deleting the contents of data directory (rm -rf /var/lib/typesense/*) and starting them one by one again?
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
02:14 AM
We'll do it. Let's us check.
02:22
Erick
02:22 AM
It's still trying to connect to the other servers. We're thinking on a fresh installation and go from there.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:50 AM
Okay, there is no other place where Typesense stores the config. So if it is still using some old configuration, it means that somehow the data directory is not being cleared correctly.
Erick
Photo of md5-5d002d197dc556e6bc619deaa3a8aba7
Erick
03:04 AM
Thanks. We'll start over the testing and put the results in this thread.
03:04
Erick
03:04 AM
Thanks for all the help Jason.

1