... I think I answered my own query These models a...
# community-help
s
... I think I answered my own query These models are good for very small chunks of text. Nothing that comes to the OpenAI 8+ k token input limit, which is very much necessary for larger documents (even when summarised) Unfortunately, using external tokenizer results in awful performance when running the TS Search on the cloud (not so when running the same external tokenizer on my own server) I think this should be looked at. It renders the tool not very scalable 😢
k
Regardless of how long a context the embedding model supports, remember that there is only a fixed number of dimensions. It's not possible to encode large pieces of text within a small number of dimensions. 8K text in 1500 dimensions is not going to perform well at all.
The way people solve this problem is via chunking. You will find plenty of materials on this online. You have to split larger documents into smaller chunks and have a
parent_id
field which maps a chunk back to a parent document id. In Typesense you can then do a group_by to group matching chunks under a single parent id.
🆗 1
s
Do I understand it correctly that the model truncates gracefully? So even if my texts are longer, it’ll truncate it without failure, right? The problem in my end rn is I have a tight (hours) amount of time to make changes and indexing alone is going to take longer than I have available So I’d plan to re-index the whole thing with chunking at a later moment and accept the probably degraded search for now As long it doesn’t fatal out, it would be ok
k
Yes we handle truncation