Combining Embedding Feature with Highlights and Large Documents
TLDR Walter had questions about Typesense's upcoming embedding feature, including highlights, embedding models, handling large documents, and its impact on cloud pricing. Jason provided answers and invited Walter to test an alpha build.
May 30, 2023 (4 months ago)
1. Will it be possible to combine this with highlights? I know that's tricky but if anyone can figure it out it's you guys 😉
2. What are the options for embedding models? Is the embedding interface generic enough to use external tools like GCP apis? Or does it depend on what you guys ship with it?
3. How would we tackle large document embeddings? I'm guessing out of the box you'll just create vectors for the entire document, but I'm trying to think through how we can segment large documents. (some of our documents represent large pdf reports).
4. how does using embeddings affect typesense cloud pricing?
5. how do synonyms interact with the embeddings?
A few questions...I know...But not urgent just interested in how you're thinking about it.
Thank you guys for all your hardwork!
Haha! If you do a hybrid search (keyword + semantic search combined) then we will highlight the keywords
> What are the options for embedding models? Is the embedding interface generic enough to use external tools like GCP apis? Or does it depend on what you guys ship with it?
We’ll be shipping with API-based models like OpenAI’s embedding model, Google PaLM and Vertex APIs. We’ll also have these in-built models: S-BERT and E5
> How would we tackle large document embeddings? I’m guessing out of the box you’ll just create vectors for the entire document, but I’m trying to think through how we can segment large documents. (some of our documents represent large pdf reports).
We don’t handle this chunking at the moment, so you would have to handle this outside of Typesense.
> how does using embeddings affect typesense cloud pricing?
Every vector dimension takes up 6-7bytes in the index. So if you use a 1536 dimension embedding model, each document will require 9.2KB - 10.8KB of additional RAM, besides the keyword-based index. We’re about to start working on reducing this by a factor of almost 60x in the next month or so.
> how do synonyms interact with the embeddings?
At the moment, synonyms only affect keyword-based search. For semantic search, we let the embedding model handle it natively.
Indexed 2779 threads (79% resolved)
Finding Similar Documents Using JSON and Embeddings
Manish wants to find similar JSON documents and asks for advice. Jason suggests using Sentence-BERT with vector query and provides guidance on working with OpenAI embeddings and Typesense. They discuss upcoming Typesense features and alternative models.
Utilizing Vector Search and Word Embeddings for Comprehensive Search in Typesense
Bill sought clarification on using vector search with multiple word embeddings in Typesense and using them instead of OpenAI's embedding. Kishore Nallan and Jason informed him that their development version 0.25 supports open source embedding models. They also resolved Bill's concerns regarding search performance, language support, and limitations in the search parameters.
Issues with Cluster Upgrade and Embedding Field
Gustavo had issues upgrading their cluster and their embedding field wasn't being filled. Jason helped to solve the upgrade issue and advised re-indexing the documents to solve the embedding field issue. Both problems were successfully resolved.