Hi I want to use Google's `text-embedding-005` mod...
# community-help
d
Hi I want to use Google's
text-embedding-005
model, but the docs aren't super clear on how Typesense handles embedding documents (or I missed it): • I want to create 256 dimensions embeddings... is setting
num_dim
enough? • Google has this concept of Task Types - is that used? Using
RETRIEVAL_DOCUMENT
and
RETRIEVAL_QUERY
might be optimal...not 100% sure. • I write to this collection daily, but the fields I want to embed don't change that often, does Typesense only update the embedding if an
embed.from
field changes? Or does a write event trigger a re-computation regardless? • I have a collection with ~800k documents... is it worth trying a batch size above 200? Not sure if I'll hit rate limits or anything. • One of the fields I want to embed can be very long - do you truncate it to a certain max length? • How do you preprocess and format/order the embedding if there's multiple
embed.from
fields (and arrays etc).?
j
CC: @Ozan Armağan
o
Hi Daniel, 1. You won't need to set the dimensions manually, Typesense will make an API call to the model with a dummy text to the model and will set
num_dim
according to the response. 2. We don't use this currently. 3. Yes, the embeddings will only going to be updated if any of the fields in
embed.from
is updated. 4. You may probably hit the rate limits, I think you should leave it as 200. 5. For local embedding models we truncate the inputs to 512 tokens, but for the remote embedder services (Google, OpenAI etc.) we don't do any truncation as they handle this internally. 6. We join all fields by space in the same order of
embed.from
You can open a Github issue for the point #2
d
For #1 - the default dimension size is 768 but they allow you to set it between 1-768. I would like to set it to 256. So are you saying I can't customize it, it will just default to 768?
o
We support that for OpenAI’s text-embedding-3-* models by setting
num_dim
manually, but not yet for Google models. Could you also open an issue for that?
m
Yes, the embeddings will only going to be updated if any of the fields in
embed.from
is updated.
this is a very important information.
I was about to create a whole data pipeline to check if my "embed.from" fields changed to avoid unnecessary embedding recreation
please, add that to the docs. Couldn't find this information there
f
please, add that to the docs. Couldn't find this information ther
Will do
d
I can definitely open those issues, I'm also curious: • Do you include facets in the embedding for a search query? • Can I store the embedding for a query so it doesn't get recalculated every time?
o
@Daniel Martel
Do you include facets in the embedding for a search query?
No we only embed query.
Can I store the embedding for a query so it doesn't get recalculated every time?
We already automatically do this, we cache results for embedding calls and reuse.
👍 1
d
Sometimes users may have no query but only facets selected... in that case for hybrid search are embeddings just not used? Is it worth opening an issue to include facets in the embedding? Especially if we embed facet fields in the document embedding.
^ nvm looking back on this that wouldn't really make sense lol. I'm assuming it just becomes a keyword search at that point? I think embedding facets when there's a query makes sense though.
o
Thanks Daniel, we will add those to our roadmap.
🙌 1