Hey all, I'm looking into using vector search to f...
# community-help
w
Hey all, I'm looking into using vector search to find similar items to a given id in my collection. Just had a question regarding embeddings, is there a way to regenerate auto embeddings in typesense? For instance, let's say "My Show" Season 1 is in my collection, and has an embedding in it that I use to find similar shows. Let's say later down the line I dynamically add "My Show" Season 2. The issue is wouldn't "My Show" Season 1 still have the same embeddings and wouldn't show Season 2 as a similar item? Additionally just to ensure my understanding is correct, items added to my collection dynamically would have auto-embeddings generated correct? Thanks for your time!
j
Embeddings are deterministic - meaning if you generate embeddings for the string "My Show", that will always be the same regardless of how many times you regenerate it
items added to my collection dynamically would have auto-embeddings generated correct?
Correct. If the strings in the
embed.from
fields you've mentioned change, then embeddings will be recalculated
w
Ah I see, so similar items will always have similar embedding values regardless of what's in the overall collection. It's my first time seeing things about embeddings so wasn't sure how it exactly worked. So just to confirm if there's 1 item in my entire collection, or 100k items. The embedding would be the exact same. So similar items added later on will appear in the vector search for similar items because theoretically they'd have a pretty close embedding anyways.
j
so similar items will always have similar embedding values regardless of what's in the overall collection.
Not similar items, the same item will always have the same embedding, regardless of what's in the overall collection. Similar items might have slightly different embeddings. So for eg, a string of "My Show Season 1" will have different embeddings than "My Show Season 2", because the strings are different.
So just to confirm if there's 1 item in my entire collection, or 100k items. The embedding would be the exact same.
That's correct.
The similarity piece only comes in to play at search time, when you query you're doing a nearest neighbor search internally, which says give me all docs with embeddings that are close to this query embedding
w
Yes, that makes sense. That's cleared up a lot for me. Thanks for your time!
👍 1
Just 2 more questions: 1. What would be considered a best model for just getting similar items, I see all the models in the repo but not sure what the real differences are. 2. Is it right to say that to speed up the embedding generation we would only need the GPU during a full ingestion to the collection? As in during let's say a vector search to find the nearest neighbour, is there any heavier processing than usual that would benefit from having a GPU for that task specifically?
j
1) "Best" depends on how close the model's training dataset is close to your own data. So you want to read up on each model and see which is closest to your domain.
all-MiniLM-L12-v2
(S-BERT) was training on all of wikipedia data and is a good general model. 2) That's correct
thankyou 1