Hey there! Is there any way to show the "relevanc...
# community-help
ó
Hey there! Is there any way to show the "relevance" of a result within a query? Like using the "rank_fusion_score"? Use case: We have a searching platform for projects, we would like to show all the posible results including those that are a relatively low match, but showing next to the result a score like 0% to 100% of accuracy or a green-yellow-red circle. Can I use the rank_fusion_score as a proxy for it in some way? Did anyone try?
k
Rank fusion score basically derives a unified score from the positions of a result in a vector search vs keyword search. For keyword search we use various signals for computing a score but it's not a percentage because of how we have to rank the signals. However, we do expose a
text_match_info
object in the response that describes some additional information like number of fields matched which will offer explanation.
ó
Yup, I read it in the docs! The question is more about if I can use them to show the final user something like a percentage of accuracy based on any of them. For example, since rank_fusion_score goes from 0 to 1 ¿Can I convert it to an absolute percentage? Like, a 0.9 would be a 90% of accuracy
k
I don't think there is a way to convert them to percentage that way.
ó
Is there any other way? Even translate them to green - yellow - red or something?
k
Typesense ranking scores are always relative, there is no absolute scale to compare them with.
ó
Hmm, and within the query? Like the top one will always be green and if there's a difference of let's say 10% show it as yellow and a bigger difference show it as red?
k
No, not possible.
ó
So there's no workaround nor possible approach to show the final user how "accurate" or "relevant" a result is agains the rest of the results?
k
You will have to derive a score like that based on your own business logic from the fields in
text_match_info
ó
Any suggestions in how?
k
You have to come up with something that makes sense for your use case. For e.g. you can divide tokens matched with total tokens in query and then average it with embedding similarity etc.
ó
So something along the lines of: • 2 tokens matched out of three in the query -> 0.66 • Embedding similary -> 0.5 • Weighted average giving 0.6 to matched and 0.4 to embeddings -> 0,56 • As a percentage -> 56% Instead of embeddings, can I use the rank fusion score? It seems more precise to the accuracy, but I guess it can be misleading if it doesn't go from 0 to 1.
k
As I said, you have to experiment and figure out what works for your use case.
ó
Got it
Trying to make this work, I noticed that I'm missing the "text_match_info.tokens_matched" as it always shows zero. ¿Is it a bug, @Kishore Nallan?
Copy code
{
  "document": {
    "title": "Mejoras en la pista de pádel del Casal"
  },
  "highlight": {
    "title": {
      "matched_tokens": [
        "pista",
        "pádel"
      ],
      "snippet": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal",
      "value": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal"
    }
  },
  "highlights": [
    {
      "field": "title",
      "matched_tokens": [
        "pista",
        "pádel"
      ],
      "snippet": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal",
      "value": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal"
    }
  ],
  "hybrid_search_info": {
    "rank_fusion_score": 0.6666666865348816
  },
  "text_match": 1042983595,
  "text_match_info": {
    "best_field_score": "509269",
    "best_field_weight": 85,
    "fields_matched": 3,
    "num_tokens_dropped": 2,
    "score": "1042983595",
    "tokens_matched": 0,
    "typo_prefix_score": 255
  },
  "vector_distance": 0.5202423930168152
}
k
For vector only matches tokens matched will be zero.
ó
It's not a vector only query, I'm passing:
Copy code
"q": "pistas de pádel",
"query_by": "title,externalReference,contractingPartyName,locationNutsName,locationNutsCode",
That's why I have highlights with marks and matched_tokens
k
Then I need to look at the dataset to see what's happening
27 rc30 instead
k
Will check and get back to you.
ó
Working on the algorithm, found that sometimes the vector_distance is null and then you have the tokens_matched field. But when the vector_distance is null, I don't get the fusion_rank_score. Anyway, maybe this helps you on what to look for
k
Vector distance is null if that record was found only in keyword match.
ó
What I mean is that the tokens_matched field only works if it's null
k
But when the vector_distance is null, I don't get the fusion_rank_score.
The rank_fusion_score is 0.5 right?
ó
yup
k
I still think that the case where tokens_matched is zero, is when we don't find that match in keyword search. The highlighting will happen neverthless because any word that is present in the query and in the field will be highlighted. I will confirm after testing.
ó
Hmmm, then there's something else going on. Like if it's found through vector search it won't apply keyword search and the other way around. Which will render the rank_fusion_score unusable for sorting also as it won't be truly consistent. The other thing I found out, it's that if even the VectorDistance is null, the distance can be quite high anyway. Here's the cosine similarity calculated with this package: https://learn.microsoft.com/en-us/dotnet/api/system.numerics.tensors.tensorprimitives.[…]imilarity?view=net-8.0&amp;viewFallbackFrom=dotnet-plat-ext-8.0
k
Like if it's found through vector search it won't apply keyword search and the other way around.
That's the purpose of the rank fusion score method. If a record is found in both methods, it will get a higher score than otherwise.
Nevertheless I will check again with the data and confirm.
ó
I mean if it's found with one method, it won't apply the second and viceversa
k
Yes. You want to boost records that are found by both methods. That's how rank fusion works.
ó
But the problem is that those records would actually be found by both but it's not computed. And it's giving a value up to 0.5 when the actual value if both are computed could be a 0.8 or 0.9 and be way higher. Thus the ordering is wrong and truly comparable.
k
Can you post the collection schema? The one you posted earlier in the other thread does not have a embedding field.
ó
Sure, let me finish some things first!
Here's a subset, but I get different results Idk why.
Copy code
{
  "name": "test",
  "fields": [
    {
      "name": "embedding",
      "type": "float[]",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "",
      "hnsw_params": {
        "M": 16,
        "ef_construction": 200
      },
      "num_dim": 1024,
      "stem": false,
      "store": true,
      "vec_dist": "cosine"
    },
    {
      "name": "externalStatusCode",
      "type": "string",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "",
      "stem": false,
      "store": true
    },
    {
      "name": "cpv",
      "type": "string[]",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "",
      "stem": false,
      "store": true
    },
    {
      "name": "contractingPartyName",
      "type": "string",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "es",
      "stem": false,
      "store": true
    },
    {
      "name": "externalReference",
      "type": "string",
      "facet": false,
      "optional": true,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "es",
      "stem": false,
      "store": true
    },
    {
      "name": "title",
      "type": "string",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": true,
      "infix": true,
      "locale": "es",
      "stem": true,
      "store": true
    }
  ],
  "default_sorting_field": "",
  "enable_nested_fields": false,
  "symbols_to_index": [],
  "token_separators": []
}
I no longer understand what's happening in the imagem but here's the query, removing the embedding (text-embeddings-3-large) (same for both runs).
Copy code
{
  "searches": [
    {
      "collection": "test",
      "vector_query": "embedding:([....],flat_search_cutoff:0,alpha:0.5,ef:128)",
      "q": "Pistas de pádel",
      "query_by": "title,externalReference,contractingPartyName",
      "prefix": false,
      "query_by_weights": "5,10,5",
      "sort_by": "_text_match:desc",
      "prioritize_exact_match": true,
      "prioritize_token_position": true,
      "page": 0,
      "per_page": 100,
      "highlight_full_fields": "title",
      "exhaustive_search": false,
      "num_typos": 0,
      "typo_tokens_threshold": 0,
      "drop_tokens_threshold": 0
    }
  ]
}
h
Was there any conclusion here? I have a similar question
k
@Óscar Vicente Can you please provide the actual embedding used in the search query shown in the screenshot so I can debug?
Copy code
"vector_query":"embedding:([....]
ó
Embeddings
There you go!
k
This is indeed a case of some records matching only on vector search but not on keyword search. Let's take the
Cimentacion
document as example. It does not contain the word
pádel
so it does not match in keyword search. Nevertheless we highlight any words in the query present in the document, even if that document was found via vector search. This is why
tokens_matched
is also 0. Look into
drop_tokens_threshold
if you want keyword search to match partial words from the query.
ó
My poing was more the other way around. In the screenshots, you can see two different cases: • VectorDistance is null (4th row in the screenshot). This means it was found only using keyword search. However, if you see the 4th column, you can see it gives a 0.55 cosine distance (I calculated it) which is a solid match. • VectorDistance is not null (first 3 rows), the tokens_matched are always 0. But as you can see in the screenshot, there's matched_tokens. But given the algorithm for the rank_fusion_score using an alpha of 0.5, the first row has also being found with the keyword search. There's two problems here: • tokens_matched being 0 with matched_tokens having values. Seems like a problem when there's records found with vector search ONLY/AND with keyword search. • The rank_fusion_score sorting being completely wrong as the records that are found with one type of search that SHOULD have been found also by the other doesn't have an score as high as it should. All of these rows should have been found by both algorithms, thus their score should have been way higher. So if a record is found with both algorithm, it would rank higher than one that's only found by one but should have been found by two even if it matches more. I think that this problem is more relevant with a huge dataset, where optimizations to avoid calculating the vector_distance happen more often and with more items. The real dataset is 3M+ and more than 20Gb big. I gave you a smaller one for the purpose of being able to share it so maybe you can test it correctly. Maybe I can share the real one with you, but you will need 32gb of RAM or more (I'm at 64gb)
k
> tokens_matched being 0 with matched_tokens having values. The
matched_tokens
are populated by the highlight logic that runs, so the definite indicator whether the record was found via keyword search is the
tokens_matched
integer value. When it's 0, then that record was only found via vector search. During highlighting, we will add any tokens in the query that are found in the field values to
matched_tokens
-- this does not mean that the record was found via keyword search.
All of these rows should have been found by both algorithms, thus their score should have been way higher.
If you tweaked the
drop_tokens_threshold
value you will let keyword search find these partial matches. It does not do so because it found enough matches that had all the query token. Likewise with vector search, if you increased
k
you could find those records with vector_distance null showing up: again, there were vector with smaller distance ahead.
But given the algorithm for the rank_fusion_score using an alpha of 0.5, the first row has also being found with the keyword search.
I suspect that the
0.5056718
value is a float precision issue. When I run the sample query you gave, I see only
0.5
exactly.
ó
But none of them are partial matches and even if the vector distance is smaller is still relevant to the sorting. If both of them should have a value as they trully matches, the sorting weight will be different. The thing is that the final set joining both kind of searches should have the algorithm run on them to give a precise sorting score otherwise the rank is nor accurate just good enough until it's not good enough. In my case, it needs to be accurate but I guess that an e-store doesn't need that level of accuracy. I'm using a 3M+ dataset so these isues are way more obvious as its more likely to happen that a relevant record can end up below a less relevant record (which is the issue I'm tracking as my users report it). It's found, it just end up in and incorrect position after sorting.
k
But none of them are partial matches
Does this happen in the smaller set you shared? Atleast for the query you've shared, when I checked the hits with
tokens_matched: 0
they were all partial matches.
ó
Hmmm, let me check again. If not, can I share with you the 20gb dataset? I'll reduce the fields and test it before, so maybe it will be 15gb. Not ideal, but I think it's the only way. And you'll get a weird dataset to test upon xD
k
It will not be possible to debug on such a large dataset. I hope we can reproduce on the smaller set. Maybe you can pick that record that occurs in the large set into the smaller set to see if that produces the same behavior?
ó
Me bet is that it will happen that obvious in the bigger one, but I'll try
k
Btw, the query you've shared is
Pistas de pádel
but in the screenshot, the
Construccion
word is highlighted, so that is a different query?
ó
I think it's a "scale" problem
Yeah, could be, but it happens also with only "Pistas de pádel". Here's the updated screenshot
I can reproduce it consistently with many results
message has been deleted
k
Ok so it's possible that there are records that match all keywords exactly but still fall outside the request per_page number. And the same record could be found in vector search.
What is the
found
when you do only keyword search?
ó
403 out of 3M
k
Right, so that's why tokens matched is zero. That record must appear outside the per page limit of 100 but getting picked in vector search. As you said, maybe we need to do a second pass at scoring records from both phases.
ó
It will be ideal, at least for our use case where we trully need accurate sorting, not ballpark. But I understand it to be optional if it's too computational expensive. Anyway, I can try to reproduce it and give you a better dataset to debug in case you need it
k
No, I think the larger set is not needed. I understand what's happening.
🤘 1
ó
Thank you for the help! I hope you could implement something!
l
We actually have almost the same requirement as @Óscar Vicente and i'm wondering how we can calculate such a score. Is there anything we could do in the meantime, while the two-pass-problem persists? Cosine-similarity is difficult to use as a raw value, fusion score isn't helpful, and text_match_info doesn't give much extra signal. We probably have different datasets, but happy to brainstorm together as well @Óscar Vicente
k
We will be working on the two pass in a week or two.
❤️ 1
🙌 2
ó
@Leon Wolf I'm going with:
Copy code
queryTokens / matchedTokens * 0.6 + (vector_distance ?? 0) * 0.4
Since both the division and the vector distance are within the 0-1 range and are representative of how good the match is. For matched tokens, for now my workaround is to take all the unique matched tokens from the highlight array. In C#, I'm using a HashSet to do it. For Vector distance, either if it's null I'll take a 0 or I'll get the embedding and calculate the similarity to give the score. The only problem, I can't use this value for truly sorting the source list until the two pass is ready, but it's a good enough solution for now
Hi @Kishore Nallan, any update on the two pass?
k
Yes, there is a new
rerank_hybrid_matches
boolean flag in
28.0.rc16
-- please try it out and let me know how it works.
🙌 1
ó
Just tested it using
"rerank_hybrid_matches ": *true*
but it still shows the previous behavior of missing the VectorDistance field and having
"tokens_matched": 0
k
I'm not sure if we are adding those fields now but the scores are definitely recalculated. For e.g. when a vector search result does not occur in keyword search, we now compute the score for keyword search by looking at the components.
ó
It will be awersome to have those fields for calculating the score, otherwise I'll have to fetch always the embeddings fields causing a lot of bandwidth and latency.
k
You are still going to compute a client score?
ó
I still need a score to show a "Relevance Score" to the client
k
Ok, but ranking wise, does it look better now?
ó
The queries I have for testing still show the same results and ordering and without having the fields I can't really double check
k
Ok will check tomorrow and confirm.
ó
Thanks!
Also, could this affect the use of _vector_distance as a sort_by option? It seems that is null and it doesn't work with buckets. Or is it a different issue?
k
We don't support buckets with vector distance. Where did you see that?
ó
Hmm, nowhere. I understood it from the documentation. I thought buckets worked with anything
k
Buckets are implemented only for
_text_match
at the moment.
1
Actually I just checked the code and we are populating the missing vector distance and text match score. I need a sample dataset and query to figure out what's going wrong.
ó
rc16?
k
Yes
Hi there! Did you have a chance to try it?
Maybe it's not in rc16, but another one
k
Are you still sending explict query via vector_query instead of relying on in-built embedding?
ó
Yup
k
I just ran that query. With per_page: 100, all 100 hits have the
vector_distance
score. However, some are indeed having
"tokens_matched": 0
-- I will check why that's so, but all of them have vector distance populated now.
ó
That's the query I'm using. The very first result doesn't have vector distance
FML, now I saw the error. I copy pasted from slack, so I got:
"rerank_hybrid_matches ": *true*,
Instead of:
"rerank_hybrid_matches": *true*,
k
So it works now?
ó
Vector distance yes, tokens matched no
k
I'm checking on the tokens_matched now
🙌 1
ó
Nice!
k
I've fixed this issue, will share a build later today / tomorrow. The value was getting overridden.
1
ó
Awesome! That was fast
It seems to be fixed in the rc18. However, if I reuse a request that I used in the previous version I still get tokens matched 0. Is there any cache I can clean?
k
What do you mean by reuse?
Yeah it's fixed in rc18
ó
Like, If I send the very same json
k
For other requests, it works?
ó
yup
k
Unless you use
use_cache
parameter, nothing gets cached. Even use cache only caches for 60s by default.
ó
Then there's something not working:
Copy code
{
  "document": {
    "awardedProposalAmounts": [],
    "awardedProposalPartyNames": [],
    "awardedProposalPartyVatIds": [],
    "contractTypeId": "3",
    "contractingPartyId": [
      "31604070204167",
      "P5030300G"
    ],
    "contractingPartyIdHierarchy": [
      "P5030300G",
      "",
      "",
      "",
      "",
      ""
    ],
    "contractingPartyName": "Consejería de Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento de Zaragoza",
    "contractingPartyNameHierarchy": [
      "Consejería de Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento de Zaragoza",
      "Zaragoza",
      "Ayuntamientos",
      "Aragón",
      "ENTIDADES LOCALES",
      "Sector Público"
    ],
    "contractingSystemTypeId": "0",
    "cpv": [
      "45212210",
      "45212200"
    ],
    "documentsCount": 7,
    "externalReference": "0043108-24",
    "externalStatusCode": "EV",
    "id": "licitacionesPerfilContratante/15630264",
    "internalId": "7e97cd55-4e93-5b6a-bb72-b93e2fa9b991",
    "link": "<https://contrataciondelestado.es/wps/poc?uri=deeplink:detalle_licitacion&idEvl=WQALDEBXv%2BA2wEhQbcAqug%3D%3D>",
    "locationNutsCode": "ES243",
    "locationNutsName": "Zaragoza",
    "lotsCount": 0,
    "parentsIds": [],
    "parentsNames": [
      "Zaragoza",
      "Ayuntamientos",
      "Zaragoza",
      "Aragón",
      "ENTIDADES LOCALES",
      "Sector Público"
    ],
    "procedureTypeId": "9",
    "projectBudgetWithTaxes": 316896.36,
    "projectBudgetWithoutTaxes": 261897.82,
    "projectPlannedPeriodDuration": 5,
    "projectPlannedPeriodDurationUnitCode": "MON",
    "projectPlannedPeriodEndDate": -62135596800,
    "projectPlannedPeriodStartDate": -62135596800,
    "tenderSubmissionEndDateTime": 1727701140,
    "title": "Dos pistas de padel cubiertas en el Barrio de Casetas. Zaragoza. Convenio DPZ.",
    "updated": 1729253205
  },
  "highlight": {
    "contractingPartyName": {
      "matched_tokens": [
        "de",
        "de"
      ],
      "snippet": "Consejería <mark>de</mark> Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento <mark>de</mark> Zaragoza"
    },
    "title": {
      "matched_tokens": [
        "pistas",
        "de",
        "de"
      ],
      "snippet": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ.",
      "value": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ."
    }
  },
  "highlights": [
    {
      "field": "title",
      "matched_tokens": [
        "pistas",
        "de",
        "de"
      ],
      "snippet": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ.",
      "value": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ."
    },
    {
      "field": "contractingPartyName",
      "matched_tokens": [
        "de",
        "de"
      ],
      "snippet": "Consejería <mark>de</mark> Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento <mark>de</mark> Zaragoza"
    }
  ],
  "hybrid_search_info": {
    "rank_fusion_score": 0.5032680034637451
  },
  "text_match": 3315704463360,
  "text_match_info": {
    "best_field_score": "1618996320",
    "best_field_weight": 0,
    "fields_matched": 0,
    "num_tokens_dropped": 3,
    "score": "3315704463360",
    "tokens_matched": 0,
    "typo_prefix_score": 159
  },
  "vector_distance": 0.4084985852241516
}
You can see there's matched_tokens, but "tokens_matched" is 0. It only happens if the query contains an accent or at least that's what it seems:
Copy code
"q": "Pistas de p\u00E1del"
If I remove the accent everything has tokens_matched, but the results are wildly different.
Copy code
"q": "Pistas de padel"
k
Ok will check
🙌 1
👀 1
Not able to reproduce this. Can you paste the actual query you are sending as curl request with the full payload?
ó
Sin título
Sure Thing!
@Kishore Nallan Were you able to reproduce it?
k
Not yet, will check and get back to you in a day or two.
🤘 1
When I index the docs you had sent earlier locally and run that query, I actually don't see any
"tokens_matched": 0
ó
This is a snippet of the first three results I get. The second one doesn't have tokens matched but it has matched tokens. I'll review it again deeper.
With this dataset (It's a bit bigger, but I had to do it for testing)
Notice that if you change from:
"q": "Pistas de pádel",
to
"q": "Pistas de padel",
it no longer happens. It happens with the second document, the one with the field
"externalReference": "0043108-24"
k
Ok I can reproduce with this dataset. Will check.
ó
Sorry about the wierd datasets 🤣 But hey, at least we can catch the edge cases. I think it has to do with accents also.
k
I suspect that the second pass text matching logic has limited information on computing the text match score. I will confirm, but in that case, there is not much we can do.
👀 1
ó
In that case, fixing the matched tokens from the highlights with accents can do the trick for me
k
Should be fixed in
28.0.rc23
-- please check and confirm.
ó
Sure! I'll reach back in 2h!
👍 1
Now it works, but it's reporting the matched tokens incorrectly because the accent. It's not taking it into account.
k
Yes that's because you're using the
es
locale which preserves the diacritics. So the word in the query and in the document are treated differently. The
fields_matched
now matches the exact number of unique highlighted words. This is the best we can do in a second pass re-ranking.
ó
Maybe that's the key for the problem with locales! Depending on the language, diacritics could mean that it's another letter or just the same letter. In most of the cases, you want to match those words with and without diacritics. Anyway, that's for another thread.
k
If you don't set a
es
locale, diacritics are removed.
ó
Do you lose other features? Or it just affect the diacritics?
k
In some languages those have specific meaning, so we have resorted to not removing them when a specific locale is used.
For Spanish, I think not setting a locale should work out of the box.
ó
Got it, so if it only affect diacritics, I'll do it. I thought it also affects the way you normalize the words and things like plurals and so on
👍 1
Last question, does it affect how stemming works?
k
We use the snowball stemmer library. So you have to try it out. Maybe the rules are different, in which case, yes that would affect it.
ó
It is, f. That's why I got so much inconsistent results. It will match the results, but not show it as matched hence highlight will be broken and we can't tell why is matching what
k
Are you using stemming primarily for handling plurals
ó
We tried, but not currently. We had to remove it because we got inconsistent results. And this is why. I'll have to test it again now that I know this.
k
If it's primarily plurals a dictionary based approach is better. We are working on a feature that will be available next week that will help.
ó
I'm mixing the original discussion anyway. We are discussing accents and things in here in case you want to move the discussion: https://typesense-community.slack.com/archives/C01P749MET0/p1729200863886629
👍 1
That would help a lot!
But won't be perfect for verbs
But hey, for it will be better anyway