#community-help

Estimating RAM Requirements for Indexing Documents

TLDR Epi asked about index sizes in relation to document sizes and RAM requirements for their dataset. Kishore Nallan suggested indexing a sample and extrapolating results, and confirmed suitability for indexing large documents like Wikipedia articles in Typesense.

Powered by Struct AI
Dec 04, 2022 (12 months ago)
Epi
Photo of md5-2f94b63d050dad6ced4a85316a658c61
Epi
03:06 AM
How large are indexes relative to document sizes, on average (for plain english documents, >5k tokens per document)? Are there benchmarks for such a scenario, or any anecdotal data?

I'm trying to figure out the RAM requirements for my dataset, which is let's say (for example) 10M-100M documents ranging from 5k to 50k tokens per document.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:28 AM
It's difficult to answer this because everything depends on the shape of the data. Repetitive tokens compress well due to the inverted index used for search. I recommend indexing a sample of 1M documents and extrapolating that for the full dataset.
Dec 05, 2022 (12 months ago)
Epi
Photo of md5-2f94b63d050dad6ced4a85316a658c61
Epi
04:15 PM
How well would Typesense work for a dataset like Wikipedia articles?
Dec 06, 2022 (12 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:55 AM
We do have several people who index large documents into Typesense.