Query on Large Text Document Embedding in OpenAI with Typesense
TLDR Mauricio asked if OpenAI and Typesense could handle large text document embeddings exceeding OpenAI's limit. Kishore Nallan recommended not to embed large strings due to quality reduction and to handle chunking in application logic as Typesense does not support automatic splitting.
1
1
Sep 07, 2023 (3 months ago)
Mauricio
02:03 PMMauricio
02:04 PMKishore Nallan
02:06 PMA) even though openai allows 8K tokens it's not a good idea to embed such large strings because it will reduce the quality of the embeddings.
B) we don't automatically split the text. Any text over the limit is ignored.
Since chunking is domain specific our current thinking is to probably leave it to application logic.
1
1
Mauricio
02:10 PMTypesense
Indexed 3015 threads (79% resolved)
Similar Threads
Finding Similar Documents Using JSON and Embeddings
Manish wants to find similar JSON documents and asks for advice. Jason suggests using Sentence-BERT with vector query and provides guidance on working with OpenAI embeddings and Typesense. They discuss upcoming Typesense features and alternative models.
Optimum Cluster for 1M Documents with OpenAI Embedding
Denny inquired about the ideal cluster configuration for handling 1M documents with openAI embedding. Jason recommended a specific configuration, explained record size calculation, and clarified embedding generation speed factors and the conditions that trigger openAI.
Issues with Embeddings on Collection with 80K Documents
Samuel experienced issues when enabling embeddings on a large collection, leading to an unhealthy cluster. Kishore Nallan suggested rolling back to a previous snapshot, advised on memory calculations for OpenAI embeddings, and confirmed that creating a new cluster should solve the problem.