#community-help

Optimal Indexing and Querying of Large Documents

TLDR Robert asks about the best practice for indexing large documents and the ideal size of subdocuments. Jason suggests experimenting with 10K words in a single document and performance testing.

Powered by Struct AI
4
3w
Solved
Join the chat
Sep 07, 2023 (3 weeks ago)
Robert
Photo of md5-2b9d58a5323a531d461fbbd7790e4aa6
Robert
05:46 PM
I remember having a conversation about this a year ago but I can't seem to find it in slack search. The documentation still doesn't seem to have a section about large documents either (I recall in the old docs at least a reference about large documents to an algolia page).

It would be super helpful to have in documents a section about indexing large documents. Best practice of course is to chunk up large documents into smaller chunks/documents and then searching on the large "document" by doing a group_by on the root document id.

My question is do have an ideal range of how big the sub document string sizes should be for optimal performance (both query time and quality of matches)? Should I be breaking up a large document into chunks of say 2k characters?

Or can I put 10k words into a string field and index it?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:01 PM
The approach would depend on what your performance requirements are and the total number of docs as well.

But in general, lesser number of words in a single document will be more performant than more number of words.

So I would recommend starting by putting 10K words in a single document, measuring performance and then splitting the large document into smaller chunks and using group_by as required.
Robert
Photo of md5-2b9d58a5323a531d461fbbd7790e4aa6
Robert
06:02 PM
Gotcha so there is no internal limit other than the performance implication
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:08 PM
Correct