I’m really impressed with the Typesense search eng...
# community-help
g
I’m really impressed with the Typesense search engine so far. Until now, I’ve been working with fewer than
100,000 items
, but I’m planning to scale up to several million. I’d love to hear from anyone who has successfully indexed around
60,000,000 items
, any insights or lessons learned would be greatly appreciated. Thank s a lot!
a
Hi Gabriel, Glad to hear you're impressed! There’s virtually no limit to the number of items you can index, as long as your cluster has enough RAM to hold them. But in general the larger the collection size, the more CPU cycles are required for search and indexing, so some queries might slow down by say a few hundred milliseconds for over 50M docs when they are all in a single collection. At that stage, you want to consider sharding your documents across multiple collections, as to maintain fast performance. A couple of things to keep in mind: • If you have display-only fields (for eg: image URLs), you can further improve write performance by just leaving those fields out of the collection schema, and instead send them in the documents when importing into Typesense. Any fields present in the documents, but not mentioned in the collection schema will just be stored on disk and won't take up RAM or CPU cycles in trying to build an index. When the document is a hit for a search query, we'll fetch the display-only fields just for that document from disk and stuff it into the API response. • Use the bulk import API to efficiently load data into your newly created cluster. I'd recommend starting with a batch size of 1000 documents per import API call, and a concurrency of say N-1 parallel import API calls, where N is the number of CPU cores in your cluster. • You might see 503s when importing into Typesense, which is the built-in back-pressure mechanism. You want to make sure you handle those in your indexing pipeline as described here
g
hello ! thanks for all this details
🙌 1
s
@Gabriel Delattre Consider checking limits , for instance, for Groups it is 250 and for Hits per Group it is 100. These are soft limits though. Also there are no aggregation operations like Sum, Average etc for Grouped Hits individually that might require extra processing on client side.
g
hmmm
I have lot of groups
hit per group ?
per instance or per collection ?
s
Hit Per Group. “group_by”.
g
Sorry, I don't get it.
Do you mean the returned result will be cut-off ?
s
Yes it will be cut-off
g
why ?
s
Founders can better tell.
Its soft limit though.
g
hmmm, this worry me a bit. I need to make facet and group by with multiple item that belongs to a category
for instance I've got 100 000 items in this category, I will
group_by
id and then I will use the faceted results
hmm
s
Ya, we had similar use case and are now stuck, not due to the limit as it can be increased but due to overhead of doing few thing on Client side like doing aggregation of results of a particualr group which we assumed to be trival and would be the part of the product like MongoDB.
g
worry me more 🙂
s
We asked fort the feature also but it seems not on their priority list as well!
k
for instance I've got 100 000 items in this category, I will
group_by
id and then I will use the faceted results
By
id
I presume it's some type of product ID?
g
yes
k
Product ID tends to be high cardinality, right. Do you expect to have 100s of records in a group when grouped by this
id
?
g
yes
even more
k
You can then increase the soft limit. We parameterized it in the v29 RC build and will soon be adding it as a configuration on the Typesense Cloud cluster configuration as well.
We also heavily optimized group-by in v29 RC build so it's quite efficient now.
g
I’m running on your cloud
:)
If you can tell me how to test
f
You can update your cluster's Typesense Version by going to the "Cluster Configuration" page and hitting "Modify" button. There will be a selectbox for it. Latest one is
v29.0.rc23
k
We still need to support modifying this group limit on Typesense Cloud. We will be adding this support today.
g
Ok great will test
Circling back on this project
We are trying to run the ingestion of our 17 millions items
but one of our clusters is running out of space
Should we increase the VCPU temporarly to ingest and then scale down as we have few traffic for now ?
a
Hi Gabriel, You can temporarily scale up vCPUs to speed up ingestion, but that alone won’t solve the disk space issue you're hitting. Disk space in Typesense is tied to RAM — each cluster comes with disk equal to 5x the RAM. So if your current cluster doesn’t have enough disk to hold all 17 million items, you’ll need to scale up the RAM, which in turn increases the disk quota. After ingestion, if your traffic is low and you don’t need as much compute, you can scale the cluster back down.