Query on Large Text Document Embedding in OpenAI with Typesense
TLDR Mauricio asked if OpenAI and Typesense could handle large text document embeddings exceeding OpenAI's limit. Kishore Nallan recommended not to embed large strings due to quality reduction and to handle chunking in application logic as Typesense does not support automatic splitting.
1
1
Sep 07, 2023 (3 weeks ago)
Mauricio
02:03 PMMauricio
02:04 PMKishore Nallan
02:06 PMA) even though openai allows 8K tokens it's not a good idea to embed such large strings because it will reduce the quality of the embeddings.
B) we don't automatically split the text. Any text over the limit is ignored.
Since chunking is domain specific our current thinking is to probably leave it to application logic.
1
1
Mauricio
02:10 PMTypesense
Indexed 2779 threads (79% resolved)
Similar Threads
Finding Similar Documents Using JSON and Embeddings
Manish wants to find similar JSON documents and asks for advice. Jason suggests using Sentence-BERT with vector query and provides guidance on working with OpenAI embeddings and Typesense. They discuss upcoming Typesense features and alternative models.
Optimum Cluster for 1M Documents with OpenAI Embedding
Denny inquired about the ideal cluster configuration for handling 1M documents with openAI embedding. Jason recommended a specific configuration, explained record size calculation, and clarified embedding generation speed factors and the conditions that trigger openAI.
Errors in Batch Import with Typesense and OpenAI API
Gustavo encountered errors when importing documents into a collection. After discussion with Jason, it was concluded that the issue stemmed from OpenAI API's handling of batch requests with problematic documents, and improvements to Typesense's error messages and handling were suggested.