We have a collection with 10+ fields and about 300...
# community-help
d
We have a collection with 10+ fields and about 300k records. We have a rare use case where semantic search would be quite useful for our users, but most of the queries are still keyword only. So we decided to enable auto-embedding with a local model for our collection, process went smoothly and all features are working, but we found that latencies for keyword search in the same collection increased by a factor of 5. (p95 changed from 120ms to 600ms). Do you have any hypothesis as to why keyword-only searches might slow down in a collection with embedding?
Configuration: • 3 nodes, 3gb memory, 3 vCPU • Memory consumption before embedding ~500mb, after ~1.6 gb • Memory increase to 5gb didn’t help
I have a blind guess that it could be related with some of
index: false
fields in the collection 🤔
k
Check if you are returning the embedding fields in the response. That will increase I/O latency and those fields are also large so take time to process.
d
Nope, we’re using strict list of
include_fields
k
So you are saying that a query that doesn't even use the embedding field is now taking that much longer?
What's your per_page? Can you try running the same keyword search query but with per_page as 1.
d
So you are saying that a query that doesn’t even use the embedding field is now taking that much longer?
Yes, exactly.
What’s your per_page? Can you try running the same keyword search query but with per_page as 1.
Will try on the week
k
I wonder somehow it's taking a long time to read that record from disk now. The per page of 1 will help us figure it that's the issue.
1
d
BTW, how to ensure that no reading from the disk happening? If I use
index: true
for all fields in the collection, should it make it truly in-memory?
k
Index true just means we enable in memory indices. There is store: false For not storing the data.
👀 1
d
Hm, I got this insight from the docs:
You want to NOT mention these fields in the collection’s schema or mark these fields as
index: false
(see
fields
schema parameter below) to mark it as an unindexed field. You can have any number of these additional unindexed fields in the documents when adding them to a collection - they will just be stored on disk, and will not take up any memory.
k
Yes. Having them on the schema with
index: false
just ensures that the data in the records are validated for presence of mandatory fields. Which won't happen if it's not part of the schema with
index: false
Also, are you doing any facets?
d
Yes, we have one facet field in the collection
k
Do you change the max facet values default?
d
No, it’s a small set of values, about ~10
k
I ran into an issue with another customer where larger docs have noticeable slow faceting performance because we rely on fetching the doc from disk for a part of the facet computation which gets slow for large docs. Will be releasing a patch for it in a day or so.
This is a problem that has surfaced when people introduced embedding vector fields which are very large.
d
Sounds similar, yes 👍
k
Btw store false approach I suggested above won't help here because that will lose the embedding index on restart.
👌 1
d
Disabling facets helped to decrease p95 from 600ms to 345ms, still not to the old 120ms 🤔
k
Yes because there is probably some latency involving in fetching the page from disk as well because of the size of the docs.
Also additional cycles to parse the json string from disk into a JSON record in the program to do field inclusion, exclusion etc.