#community-help

Discussion on Scaling Performance with Large Datasets

TLDR Shouvik inquired about the performance of typesense with a large dataset, 150GB in size. Masahiro noted that sufficient RAM is crucial while Kishore Nallan suggested removing common stop words from the dataset to reduce memory use.

Powered by Struct AI

1

1

May 02, 2021 (33 months ago)
Shouvik
Photo of md5-67fe11c640c26eb6c6aa947473942d60
Shouvik
01:40 PM
Also wondering about scalability. Saw your demo in haystack live which was awesome btw! The demo where you have 28 million records for book titles was pretty fast but a little bit slower (200ms Vs 20ms) than the 2-3 million records for recipes. For our use case we’re looking to move digitized textbook data, we have over 5000 textbooks. These translate to over 250 million records in our db (a record for each paragraph) about 150GB of data. Do you guys have any metrics on how performant typesense will be on such a large dataset ?
May 03, 2021 (33 months ago)
Masahiro
Photo of md5-366dff6b5f9b1a7d0f404fdc3261e573
Masahiro
06:33 AM
In my case, about 3M documents uses 1.5GB ram. Its performance is not problem, fast enough.
I think the problem is your ram disk space. Even with simple calculation, 120GB RAM is needed.

1

Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
06:40 AM
Yes, the trade-off is speed vs memory. While we have certain items on our backlog to decrease memory consumption, ultimately you cannot search instantly on large number of records without adequate memory.

One thing that can help is removing commonly occurring stop words from the dataset -- this should reduce memory consumption with negligible impact on search experience.

1