#community-help

Discussing Dataset Indexing and Instance Reboots

TLDR Thomas asked questions about dataset indexing and instance reboots. Kishore Nallan clarified that endpoints are synchronous for indexing, re-indexing happens on instance restarts, and upgrades shouldn't cause issues. CPU speed is identified as a bottleneck during this process. They suggested using CRIU for periodic RAM dumps to avoid re-indexing on reboot.

Powered by Struct AI

2

1

24
20mo
Solved
Join the chat
Mar 02, 2022 (20 months ago)
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
11:44 AM
How can I know that indexing completed on a dataset?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:45 AM
When the endpoint response arrives, indexing is done. All endpoints are synchronous.
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
11:46 AM
If we restart the instance, it take a long time before it's usable, like 30 minutes. It re-index on reboot?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:47 AM
Yes, only raw documents are stored on disk and indexing happens in memory on restart.
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
11:48 AM
Alright, what's the bottleneck on that? CPU or Disk speed?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:48 AM
CPU. The latest 0.23 RC builds are faster in this respect.
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
11:51 AM
How much faster is 0.23 RC?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:51 AM
Depends on dataset. Primary work is around numerical fields.
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
11:52 AM
Dataset is majority text
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:52 AM
We recommend running a 3 node configuration so rotationse can be done without a single point of failure.
11:53
Kishore Nallan
11:53 AM
There might be a few other things we can still do to optimize text fields.
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
11:59 AM
Yeah we're starting with 3 nodes in a cluster
11:59
Thomas
11:59 AM
What's the optimal size in terms of keys?
12:00
Thomas
12:00 PM
We have 60 datapoints that need to filter on
12:11
Thomas
12:11 PM
Kishore Nallan How often do the API break, could rolling upgrades on a cluster be a problem?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:11 PM
Every node stores all the data so nodes help in increasing throughput acorss many users.
12:12
Kishore Nallan
12:12 PM
We've successfully done 5 versions so far on Typesense cloud across hundreds of deployments. We take care about backward compatibility.
12:14
Kishore Nallan
12:14 PM
We store nothing but documents on disk so not much problem with upgrades.
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
12:18 PM
Superb 🙂
12:18
Thomas
12:18 PM
Thank you
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:18 PM
👍
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
12:47 PM
Kishore Nallan It's not possible to dump, periodically the RAM that's the index, to disk, so it doesn't need to be re-indexed on reboot?

2

Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:37 PM
You could try doing this via CRIU: https://criu.org/Main_Page

1

Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
02:48 PM
Yeah that's how we currently do it with KVM, but this doesn't help if there's a hardware issue