#community-help

Hybrid Search Distance Threshold Issue

TLDR Anish has an issue with search results not respecting the vector distance threshold when using hybrid search. Jason explains additional fields cause vector_distance to only apply to vector search results and suggests opening a feature request on GitHub.

Powered by Struct AI

2

Sep 12, 2023 (2 weeks ago)
Anish
Photo of md5-4a81a7211dc7926511b3967d3019e75b
Anish
03:47 PM
👋 When I set a distance threshold like vector_query' : 'embedding:([], distance_threshold:0.30)' it doesn't seem to affect the results of a hybrid search. Some of the results returned have vector distance of 2. How would I set a threshold?
03:48
Anish
03:48 PM
I'm using 'query_by': 'embedding,description'
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:59 PM
Hmm, that is indeed the way to do it...
03:59
Jason
03:59 PM
I'm not able to reproduce the issue on a test dataset. For eg:

➜  ~ curl -s '' \
-X 'POST' \
--data-binary '
      {
        "searches": [
          {
            "query_by": "embedding",
            "vector_query": "embedding:([], distance_threshold:0.50, k:10)",
            "collection": "hn-comments",
            "q": "cinema"
          }
        ]
      }
' | jq '.results[0].hits[].vector_distance'
0.35301458835601807
0.3530757427215576
0.37170201539993286
0.37984395027160645
0.39115262031555176
0.4030781388282776
0.4054117798805237
0.410835325717926
0.41370344161987305
0.4149748682975769
04:00
Jason
04:00 PM
Could you give me a curl snippet like this that creates a sample collection and documents, and reproduces the issue?
Anish
Photo of md5-4a81a7211dc7926511b3967d3019e75b
Anish
04:06 PM
Jason So actually it works fine with "query_by": "embedding",, but as soon I add another field there the vector distance isn't respected.
04:07
Anish
04:07 PM
It's difficult to provide a curl snippet with the embeddings and everything, i can try and set up an example
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:07 PM
Ah I see what you mean. For eg:

➜  ~ curl -s '' \
-X 'POST' \
--data-binary '
      {
        "searches": [
          {
            "query_by": "text,embedding",
            "vector_query": "embedding:([], distance_threshold:0.50, k:10)",
            "collection": "hn-comments",
            "q": "cinema"
          }
        ]
      }
' | jq '.results[0].hits[].vector_distance'
2
2
0.35301458835601807
2
2
0.3530757427215576
2
0.37170201539993286
2
2

1

04:08
Jason
04:08 PM
In a hybrid search, the vector_distance field is being set to 2 for all the results that are because of keyword search.
04:09
Jason
04:09 PM
Technically vector_distance is only calculated for results that were pulled up from a vector search and distance threshold only applies to vector search.
04:10
Jason
04:10 PM
So the bug here is that we're returning vector_distance: 2 for keyword-matched results, when technically we should not be returning that field at all
Anish
Photo of md5-4a81a7211dc7926511b3967d3019e75b
Anish
04:11 PM
Is there a way to change the rank fusion formula (like not 0.7 keyword but lower)?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:12 PM
Not at the moment... Could you open a feature request GitHub issue for this?

1