Hello! I am having trouble with hybrid search at t...
# community-help
j
Hello! I am having trouble with hybrid search at the moment. It appears that i get incorrect ranking in certain circumstances. In particular, when there is a tie in the text matching, i expect to get the document with the lowest vector distance on top, but this is not happening From the documentation we have
Copy code
K = rank of document in keyword search
S = rank of document in semantic search

rank_fusion_score = 0.7 * K + 0.3 * S
It appears that if we have several hits with the same text match, they will get an arbitrary keyword search rank. This arbitrary rank will then be weighted together with the semantic search rank. The result seems to be arbitrary in the end, even if semantically closer documents clearly should be on top. This is something that we experience as random ordering for many searches, which is not great. Perhaps this algorithm can be adjusted to allow several documents with the same keyword search rank, in order to make the semantic search rank the tie breaker.
k
In recent v28 RC builds, you can try setting
rerank_hybrid_matches: true
search parameter. When enabled, it'll compute text_match_score for records found with vector search only and vice versa. This might improve overall quality of hybrid search.
You can also directly rank keyword search results with semantic search this way: https://typesense.org/docs/27.1/api/vector-search.html#rank-keyword-search-via-vector-search
j
I have considered ranking the results on vector distance, but as far as i understand, this will essentially disable any keyword search functionality. It will just boil down to a vector search, which is not great either.
correct me if am wrong here.
How can i understand this new rerank_hybrid_matches parameter?
k
this will essentially disable any keyword search functionality.
Why? Vector search will only be used to break ties in keyword search.
How can i understand this new rerank_hybrid_matches parameter?
There will be documents that appear in top-K keyword hits but not in top-K semantic search hits (and vice versa). This option will make the engine compute the missing complementary score so that there is always a complete picture.
👍 2
j
ok, rerank_hybrid_matches seems great, and something i have been struggling with as well. so if i provide an explicit sorting like in the docs
Copy code
{
  "q": "shoes",
  "query_by": "title",
  "sort_by": "_text_match:desc,_vector_query(embedding:([])):asc"
}
does this disable the hybrid search score and the K parameter?
k
Yes, with this we will sort first by text match score and only if there is a tie break, the vector query score (semantic search score) is used. Normal hybrid search works using the fusion formula, which will not apply here.
j
I am looking for a way to continuously combine text match and vector contributions. It seems like the current rank fusion score is broken, in the above sense. can you point me to the part of the code where the score is computed, so i can make my own version?
k
I don't follow, can you elaborate on what you mean by its broken
j
if we have several hits with the same text match score, I would expect them to be sorted by vector similarity But with the rank fusion algorithm, the hits are ranked strictly, even if they have the same text match score. so you get some document A with a much higher rank fusion score than document B, even tough document B has the same text match and a better vector distance
( this is my interpretation of what is going on, i have not seen the code)
k
> if we have several hits with the same text match score, I would expect them to be sorted by vector similarity That's not how rank fusion works. Rank fusion uses the rank of the document in keyword search and combines that with rank of the document in semantic search and
alpha
parameter for weighting both components. So even if several hits have the same text match score, if there is no secondary sorting condition, they have to be ordered somehow -- which in our case is done by the ID of the record (document indexing order).
If you strictly need that type of behavior you have to rank keyword search by semantic search (link I shared earlier).
j
Yes, that is my understanding. I think the issue for me is that I really need this weighted combination of vector and text. But due to the implementation of the rank fusion, the ID of the document becomes more significant than the vector score in many cases. It leads to results that are unexpected, and not usable for us. I would argue that it is in fact broken, since it will order A before B even when B is strictly better than A . One could consider a slight modification, where the rank factor in the expression can be the same for equal documents. You could then have say 3 documents with rank 1, and use the same formula for rank fusion, and get a correct ordering
k
This makes sense. Can you please create a GitHub issue? We will pick it up.
j
I will do that. Thank you!