Hello Typesense team, I'm considering using Types...
# community-help
b
Hello Typesense team, I'm considering using Typesense for a large-scale document management project, and I have some questions regarding performance, scalability, and disaster recovery. Our current setup with Solr has the following stats: - 178,173,453 documents indexed - Average of 18,000 documents indexed daily in 2023 - Current Solr index size: 115.77 GB - 1.0 TB of textual and metadata for articles (excluding images and binary documents) - 100 TB of binary data (images, thumbnails, original XMLs, etc.) Given this scale, I have several questions and concerns: 1. Snapshot Restoration Time: - How long would it typically take to restore from a snapshot, particularly for rebuilding in-memory indexes? - Are there any recommendations for speeding up this process? 2. Snapshot Format and Size Limitations: - Is the snapshot a single file or multiple files? - Are there any size limitations for snapshots? (Our current Solr index is 100GB and causes issues) 3. Multi-Instance Setup for Tiered Access: We're considering setting up multiple Typesense instances to structure collections based on access patterns: - Hot data (1-year-old articles) - Semi-hot data (2-5 year old articles) - Cold data (5+ years old) Do you have any recommendations or best practices for this approach? 4. Disaster Recovery Strategies: - Beyond snapshots, what other disaster recovery strategies would you recommend for a dataset of our size?
ó
You wouldn't be able to hold that much data in memory unless you can split it in machines of 32 gb to 64 gb. That's the hard limit we reached, as rebuilding time (which also means downtime) is about an hour for 3M+ records in a collection and 32Gb ram (28Gb are the indexes). I suggest you to use the tiered access but only use typesense in the hot data one and another, disk based solution for the rest
b
@Óscar Vicente Thanks for your answer, could you detail the limit that you reached? What do you mean by tiered access?
ó
The limit is that it takes an hour to rebuild the indexes and that affects also restarts, etc...
So if there's any issue, you risk having a very long downtime
OR if you change or add any field..
It's too impractical
b
My goal is to reduce the downtime of reconstructing the index
ó
Well, each indexed field is only handled by one core. The only way to "parallelize it" is by splitting it in multiple smaller collections. Then you'll have to handle issues like relevance ordering, joining the results of multiple collection searches, etc...
So it really dependes on how you want to query your data
b
We are planning to have this kind of collection Collection Article {} Collection Image {} Collection Vidéo {} Collection Audio {} Collection Metadata {} We were thinking maybe split also collection article per year. Typesense will be mostly used as DB for building on top a query engine for the business. We will store then the query so we can retrieve the data after to construct our web page for example
ó
How many articles do you have per year, how many fields you will index and what's the average size in bytes of each? Will you use embeddings also? AND whats de budget?
Will you need HA?
b
How many articles do you have per year? => 50k how many fields you will index? => 200 what's the average size in bytes of each? => the total is 10.2 KB Will you use embeddings also? => we will do outside whats de budget? => we don't count at the moment Will you need HA? => Yes
ó
50k articles 200 indexed fields?? Will you search in all those fields? Or you just want to store 200 fields and search through 5-10? 10.2KB is the size of a record? The sum of the averages of each field? What's the size you will use? For multilanguage, I use 1024 dimensions, which are floats (32b per dimension) So, given the size of a record multiplied by the average size of a indexed field multiplied by the number of indexed fields multiplied by 2-3 will give you an aproximate size in memory of the collection. Then, the sweetspot for a big typesense cluster is between 4/8 cores and 32gb/64gb of ram given my experience. Multiplied by 3 for the HA and by the number of tiers will give you the estimated cost. Also, you can calculate the number of tiers or clusters needed using the previous memory size value.
But, if you want to index a lot of small fields, you can probably use bigger machines. In my use case, I have to index 2 very big fields, so they use one core each and takes like an hour at every restart or problem (and gives problems to sync in the cluster because it's too big)
b
50k articles 200 indexed fields?? Will you search in all those fields? Or you just want to store 200 fields and search through 5-10? => Yes, I will only search for 5-10 field 10.2KB is the size of a record? The sum of the averages of each field? => it is the size of the record What's the size you will use? For multilanguage, I use 1024 dimensions, which are floats (32b per dimension) => 50k it is only for one brand, we have 7-10 brand The biggest field will be the content of an article
10.2 kb is a small article
ó
But is 10.2 kb the size of the fields to be indexed?
And the dimensions of the embeddings you'll use?
j
Thank you for sharing your experience Oscar.
You wouldn't be able to hold that much data in memory unless you can split it in machines of 32 gb to 64 gb. That's the hard limit we reached
Quick note to clarify this - we have users using Typesense with several hundreds of GBs of RAM. So this is not a hard limit within Typesense. As long as you have sufficient RAM to hold your data in memory and CPU to handle the indices / searching, Typesense can handle more data.
The limit is that it takes an hour to rebuild the indexes and that affects also restarts, etc...
Typesense does rebuild indices on restarts, and the amount of time it takes depends on the number of CPU cores you have, and the configuration of
num-collections-parallel-load
and
num-documents-parallel-load
. So for 100s of millions of rows, it could take a few hours to rebuild the indices. This is a conscious design decision we made to in order to keep version upgrades seemless - it's just a restart of the process and Typesense will reindex using any new datastructures that might have changed internally.
So if there's any issue, you risk having a very long downtime
To avoid this downtime in production, you'd want to run a clustered Highly Available setup with multiple nodes, and only rotate one node at a time, wait for it to come back, before rotating other nodes. This way the cluster can still accept reads / writes on the other two nodes, while the 3rd node is being rotated and is rebuilding indices.
Boubacar - I've answered 1) above 2) Snapshot is multiple files. No limitations on snapshot size. But I've only seen users with up to 500GB of disk data, there might be bottlenecks with larger sizes that we haven't tested for yet. 3) Sounds reasonable. In general, the smaller the number of documents, the more performant queries are. 4) Typesense is not meant to be used as your primary datastore. So for disaster recovery, you'd typically want to do regular backups on your primary datastore. That said, you could also snapshot the copy of the data you've sent to Typesense using the snapshot API and start a new cluster with that snapshot. The key thing to consider for a dataset of your size, is the startup time. So it's critical to run multiple nodes in an HA configuration
ó
Thanks @Jason Bosco! The experience was about having big fields indexed in one collection and not for performant reasons but for operational pains as having a node down for multiple hours for updates or issues means: • You won't have HA while upgrading or having issues for several hours, high risk. Or you'll have to pay for another node while the process happens, which means adding several hours and cost to spin it up. In the hundreds of Gbs, it is costly. • The resync time between nodes takes time. • In case you need to use the backup from the primary source for any catastrophic reason, that means at the very least several hours to several days of work and downtime. • Any change to add or modify fields will be a painful and very long process. If you want to do it following a parallel changes approach (also known as expand and contract) it will take a long time as you first need to add the new field and update all the records, then deploy the change to start using that field and then drop the first, which is several hours per step. With the 32gb 64gb for having 3 very big fields with millions of records, that's the sweetspot as it's one hour downtime and you can make changes within the work day as per the one core per index limit (3 cores at max in use for this case). In his case, with 5-10 he will benefit having 8-16 cores, so probable he can go to 128Gb or even 256Gb configurations. But it needs testing. I just wanted to share the whole experience and learnings we already got
The more, smaller collections the better
j
The more, smaller collections the better
In general yes - this is better.
You won't have HA while upgrading... Or you'll have to pay for another node while the process happens
When you set up HA, you'll be spinning up 3 nodes and running it 24x7. Then when you upgrade, you'll rotate one node at a time. So even if that single node takes a few hours for a large dataset, the other two nodes will still continue serving traffic, while the 3rd one is re-indexing. So there won't be any downtime when upgrading in an HA set up. Once all 3 nodes are stable again, you'd then rotate the 2nd node, wait for it to reindex and then the 3rd node. So during this whole operation the cluster will still be healthy and serving requests with the other two nodes. Besides that point, yes - larger collections will take more time for schema changes, reindexing, syncing between nodes during rotations, etc. So best to keep collections smaller when possible.
ó
No, what I meant is while you are upgrading a node, if other goes down for whatever reason you are hoing to have a bad time. You will be handling the load of 3 nodes into 2, making it more probable to end up having only one. And yes, that's something that affects any type of clustering and that's why you try to avoid as much as you can expensive or long operations in the best way. But you need to be aware and ready in case it happens