Optimal Indexing and Querying of Large Documents
TLDR Robert asks about the best practice for indexing large documents and the ideal size of subdocuments. Jason suggests experimenting with 10K words in a single document and performance testing.
Sep 07, 2023 (3 weeks ago)
Robert
05:46 PMIt would be super helpful to have in documents a section about indexing large documents. Best practice of course is to chunk up large documents into smaller chunks/documents and then searching on the large "document" by doing a group_by on the root document id.
My question is do have an ideal range of how big the sub document string sizes should be for optimal performance (both query time and quality of matches)? Should I be breaking up a large document into chunks of say 2k characters?
Or can I put 10k words into a string field and index it?
Jason
06:01 PMBut in general, lesser number of words in a single document will be more performant than more number of words.
So I would recommend starting by putting 10K words in a single document, measuring performance and then splitting the large document into smaller chunks and using
group_by
as required.Robert
06:02 PMJason
06:08 PMTypesense
Indexed 2776 threads (79% resolved)
Similar Threads
Discussing Large Document Indexing in Word Files
robert asked about indexing large word files. Kishore Nallan advised splitting into smaller documents for improved performance.
Discussing Document Indexing Speeds and Typesense Features
Thomas asks about the speed of indexing and associated factors. The conversation reveals that larger batch sizes and NVMe disk usage can improve speed, but the index size is limited by RAM. Jason shares plans on supporting nested fields, and they explore a solution for products in multiple categories and catalogs.
Estimating RAM Requirements for Indexing Documents
Epi asked about index sizes in relation to document sizes and RAM requirements for their dataset. Kishore Nallan suggested indexing a sample and extrapolating results, and confirmed suitability for indexing large documents like Wikipedia articles in Typesense.