#community-help

Query on Large Text Document Embedding in OpenAI with Typesense

TLDR Mauricio asked if OpenAI and Typesense could handle large text document embeddings exceeding OpenAI's limit. Kishore Nallan recommended not to embed large strings due to quality reduction and to handle chunking in application logic as Typesense does not support automatic splitting.

Powered by Struct AI

1

1

Sep 07, 2023 (3 months ago)
Mauricio
Photo of md5-1c6a30561de3a358d125fe7bed327f22
Mauricio
02:03 PM
Another quick question. We have some pretty large text documents that we want to avoid breaking up as much as possible. They exceed OpenAI’s 8k token limit so we can’t generate an embedding in a single API call. If we use the automatic embedding generation with the OpenAI embeddings, will typesense error out? We were thinking of chunking the document, generating and embedding for every chunk and then averaging the embeddings to get a final embedding. Could typesense help us out there?
02:04
Mauricio
02:04 PM
I’m pretty sure we could do that ourselves and manually add the embeddings for the large sections but wanted to know if we could avoid that work to better estimate the migration effort
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:06 PM
Couple of things:

A) even though openai allows 8K tokens it's not a good idea to embed such large strings because it will reduce the quality of the embeddings.

B) we don't automatically split the text. Any text over the limit is ignored.

Since chunking is domain specific our current thinking is to probably leave it to application logic.

1

1

Mauricio
Photo of md5-1c6a30561de3a358d125fe7bed327f22
Mauricio
02:10 PM
Gotcha! Thanks