#community-help

Combining Embedding Feature with Highlights and Large Documents

TLDR Walter had questions about Typesense's upcoming embedding feature, including highlights, embedding models, handling large documents, and its impact on cloud pricing. Jason provided answers and invited Walter to test an alpha build.

Powered by Struct AI
5
4mo
Solved
Join the chat
May 30, 2023 (4 months ago)
Walter
Photo of md5-b0a343a23053bb091cc198f636ad4103
Walter
11:18 PM
hey guys I am really looking forward to the embedding feature in the upcoming release. We are using typesense prolifically to search our research archive and chart library at portal.variantperception.com

1. Will it be possible to combine this with highlights? I know that's tricky but if anyone can figure it out it's you guys 😉
2. What are the options for embedding models? Is the embedding interface generic enough to use external tools like GCP apis? Or does it depend on what you guys ship with it?
3. How would we tackle large document embeddings? I'm guessing out of the box you'll just create vectors for the entire document, but I'm trying to think through how we can segment large documents. (some of our documents represent large pdf reports).
4. how does using embeddings affect typesense cloud pricing?
5. how do synonyms interact with the embeddings?
A few questions...I know...But not urgent just interested in how you're thinking about it.

Thank you guys for all your hardwork!
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:26 PM
> Will it be possible to combine this with highlights? I know that’s tricky but if anyone can figure it out it’s you guys 😉
Haha! If you do a hybrid search (keyword + semantic search combined) then we will highlight the keywords

> What are the options for embedding models? Is the embedding interface generic enough to use external tools like GCP apis? Or does it depend on what you guys ship with it?
We’ll be shipping with API-based models like OpenAI’s embedding model, Google PaLM and Vertex APIs. We’ll also have these in-built models: S-BERT and E5

> How would we tackle large document embeddings? I’m guessing out of the box you’ll just create vectors for the entire document, but I’m trying to think through how we can segment large documents. (some of our documents represent large pdf reports).
We don’t handle this chunking at the moment, so you would have to handle this outside of Typesense.

> how does using embeddings affect typesense cloud pricing?
Every vector dimension takes up 6-7bytes in the index. So if you use a 1536 dimension embedding model, each document will require 9.2KB - 10.8KB of additional RAM, besides the keyword-based index. We’re about to start working on reducing this by a factor of almost 60x in the next month or so.

> how do synonyms interact with the embeddings?
At the moment, synonyms only affect keyword-based search. For semantic search, we let the embedding model handle it natively.
11:26
Jason
11:26 PM
We actually have an alpha build with the above features. Let me know if you’d be interested in testing it out!
Walter
Photo of md5-b0a343a23053bb091cc198f636ad4103
Walter
11:44 PM
oh wow, yes I am very interested. I would love to test that out on our dev app. can I give you our cluster id for dev?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:45 PM
Yeah, if you can DM me your cluster ID, I can queue up an upgrade