j
Coming soon!
👍 3
❤️ 7
😮 2
w
So this will be really helpful for: 1. importing documents faster 2. scaling search request volume is that about right?
j
Exactly, specifically when using built-in models
w
right. and I guess it would also enable adding some of the other, larger models as well? One of our typescript collections contains documents for entire research PDFs (in text format). e5-small embedding size is 512? - so about a written page. Ideally we can choose a larger token length for this collection than the others.
k
Most embedding models only support token lengths of 512 because more than that, the meaning of the embedding gets diluted. For e.g. you have only a few hundred dimensions to encode the semantic meaning of the data. So to encode large text you have to split them up.
E.g. 1 page per document and have a
parent_doc_id
so that the results can be grouped at query time.
w
ok thank you, that makes sense.