Hey all I m looking into using vector search to find similar typesense #community-help

Hey all, I'm looking into using vector search to f...

Wahid Bawa

01/14/2025, 5:19 PM

Hey all, I'm looking into using vector search to find similar items to a given id in my collection. Just had a question regarding embeddings, is there a way to regenerate auto embeddings in typesense? For instance, let's say "My Show" Season 1 is in my collection, and has an embedding in it that I use to find similar shows. Let's say later down the line I dynamically add "My Show" Season 2. The issue is wouldn't "My Show" Season 1 still have the same embeddings and wouldn't show Season 2 as a similar item? Additionally just to ensure my understanding is correct, items added to my collection dynamically would have auto-embeddings generated correct? Thanks for your time!

Jason Bosco

01/14/2025, 5:50 PM

Embeddings are deterministic - meaning if you generate embeddings for the string "My Show", that will always be the same regardless of how many times you regenerate it

Jason Bosco

01/14/2025, 5:50 PM

items added to my collection dynamically would have auto-embeddings generated correct?

Correct. If the strings in the

embed.from

fields you've mentioned change, then embeddings will be recalculated

Wahid Bawa

01/14/2025, 5:52 PM

Ah I see, so similar items will always have similar embedding values regardless of what's in the overall collection. It's my first time seeing things about embeddings so wasn't sure how it exactly worked. So just to confirm if there's 1 item in my entire collection, or 100k items. The embedding would be the exact same. So similar items added later on will appear in the vector search for similar items because theoretically they'd have a pretty close embedding anyways.

Jason Bosco

01/14/2025, 5:59 PM

so similar items will always have similar embedding values regardless of what's in the overall collection.

Not similar items, the same item will always have the same embedding, regardless of what's in the overall collection. Similar items might have slightly different embeddings. So for eg, a string of "My Show Season 1" will have different embeddings than "My Show Season 2", because the strings are different.

So just to confirm if there's 1 item in my entire collection, or 100k items. The embedding would be the exact same.

That's correct.

Jason Bosco

01/14/2025, 5:59 PM

The similarity piece only comes in to play at search time, when you query you're doing a nearest neighbor search internally, which says give me all docs with embeddings that are close to this query embedding

Wahid Bawa

01/14/2025, 6:00 PM

Yes, that makes sense. That's cleared up a lot for me. Thanks for your time!

👍 1

Wahid Bawa

01/14/2025, 10:20 PM

Just 2 more questions: 1. What would be considered a best model for just getting similar items, I see all the models in the repo but not sure what the real differences are. 2. Is it right to say that to speed up the embedding generation we would only need the GPU during a full ingestion to the collection? As in during let's say a vector search to find the nearest neighbour, is there any heavier processing than usual that would benefit from having a GPU for that task specifically?

Jason Bosco

01/14/2025, 10:32 PM

1) "Best" depends on how close the model's training dataset is close to your own data. So you want to read up on each model and see which is closest to your domain.

all-MiniLM-L12-v2

(S-BERT) was training on all of wikipedia data and is a good general model. 2) That's correct

thankyou 1

Open in Slack

Previous Next