Hey there Is there any way to show the relevance of a result typesense #community-help

Hey there! Is there any way to show the "relevanc...

Óscar Vicente

08/08/2024, 9:11 AM

Hey there! Is there any way to show the "relevance" of a result within a query? Like using the "rank_fusion_score"? Use case: We have a searching platform for projects, we would like to show all the posible results including those that are a relatively low match, but showing next to the result a score like 0% to 100% of accuracy or a green-yellow-red circle. Can I use the rank_fusion_score as a proxy for it in some way? Did anyone try?

Kishore Nallan

08/08/2024, 9:56 AM

Rank fusion score basically derives a unified score from the positions of a result in a vector search vs keyword search. For keyword search we use various signals for computing a score but it's not a percentage because of how we have to rank the signals. However, we do expose a

text_match_info

object in the response that describes some additional information like number of fields matched which will offer explanation.

Óscar Vicente

08/08/2024, 9:58 AM

Yup, I read it in the docs! The question is more about if I can use them to show the final user something like a percentage of accuracy based on any of them. For example, since rank_fusion_score goes from 0 to 1 ¿Can I convert it to an absolute percentage? Like, a 0.9 would be a 90% of accuracy

Kishore Nallan

08/08/2024, 10:04 AM

I don't think there is a way to convert them to percentage that way.

Óscar Vicente

08/08/2024, 10:05 AM

Is there any other way? Even translate them to green - yellow - red or something?

Kishore Nallan

08/08/2024, 10:06 AM

Typesense ranking scores are always relative, there is no absolute scale to compare them with.

Óscar Vicente

08/08/2024, 10:07 AM

Hmm, and within the query? Like the top one will always be green and if there's a difference of let's say 10% show it as yellow and a bigger difference show it as red?

Kishore Nallan

08/08/2024, 10:13 AM

No, not possible.

Óscar Vicente

08/08/2024, 10:15 AM

So there's no workaround nor possible approach to show the final user how "accurate" or "relevant" a result is agains the rest of the results?

Kishore Nallan

08/08/2024, 10:16 AM

You will have to derive a score like that based on your own business logic from the fields in

text_match_info

Óscar Vicente

08/08/2024, 10:17 AM

Any suggestions in how?

Kishore Nallan

08/08/2024, 10:19 AM

You have to come up with something that makes sense for your use case. For e.g. you can divide tokens matched with total tokens in query and then average it with embedding similarity etc.

Óscar Vicente

08/08/2024, 10:26 AM

So something along the lines of: • 2 tokens matched out of three in the query -> 0.66 • Embedding similary -> 0.5 • Weighted average giving 0.6 to matched and 0.4 to embeddings -> 0,56 • As a percentage -> 56% Instead of embeddings, can I use the rank fusion score? It seems more precise to the accuracy, but I guess it can be misleading if it doesn't go from 0 to 1.

Kishore Nallan

08/08/2024, 10:29 AM

As I said, you have to experiment and figure out what works for your use case.

Óscar Vicente

08/08/2024, 10:29 AM

Got it

Óscar Vicente

08/12/2024, 3:20 PM

Trying to make this work, I noticed that I'm missing the "text_match_info.tokens_matched" as it always shows zero. ¿Is it a bug, @Kishore Nallan?

Copy code

{
  "document": {
    "title": "Mejoras en la pista de pádel del Casal"
  },
  "highlight": {
    "title": {
      "matched_tokens": [
        "pista",
        "pádel"
      ],
      "snippet": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal",
      "value": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal"
    }
  },
  "highlights": [
    {
      "field": "title",
      "matched_tokens": [
        "pista",
        "pádel"
      ],
      "snippet": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal",
      "value": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal"
    }
  ],
  "hybrid_search_info": {
    "rank_fusion_score": 0.6666666865348816
  },
  "text_match": 1042983595,
  "text_match_info": {
    "best_field_score": "509269",
    "best_field_weight": 85,
    "fields_matched": 3,
    "num_tokens_dropped": 2,
    "score": "1042983595",
    "tokens_matched": 0,
    "typo_prefix_score": 255
  },
  "vector_distance": 0.5202423930168152
}

Kishore Nallan

08/12/2024, 3:24 PM

For vector only matches tokens matched will be zero.

Óscar Vicente

08/13/2024, 8:02 AM

It's not a vector only query, I'm passing:

Copy code

"q": "pistas de pádel",
"query_by": "title,externalReference,contractingPartyName,locationNutsName,locationNutsCode",

That's why I have highlights with marks and matched_tokens

Kishore Nallan

08/13/2024, 8:03 AM

Then I need to look at the dataset to see what's happening

Óscar Vicente

08/13/2024, 8:05 AM

https://typesense-community.slack.com/archives/C01P749MET0/p1719396333902779?thread_ts=1713636884.594209&cid=C01P749MET0 It's this very same

Óscar Vicente

08/13/2024, 8:05 AM

27 rc30 instead

Kishore Nallan

08/13/2024, 11:20 AM

Will check and get back to you.

Óscar Vicente

08/13/2024, 11:39 AM

Working on the algorithm, found that sometimes the vector_distance is null and then you have the tokens_matched field. But when the vector_distance is null, I don't get the fusion_rank_score. Anyway, maybe this helps you on what to look for

Kishore Nallan

08/13/2024, 11:41 AM

Vector distance is null if that record was found only in keyword match.

Óscar Vicente

08/13/2024, 11:41 AM

What I mean is that the tokens_matched field only works if it's null

Kishore Nallan

08/13/2024, 11:41 AM

But when the vector_distance is null, I don't get the fusion_rank_score.

The rank_fusion_score is 0.5 right?

Óscar Vicente

08/13/2024, 11:42 AM

yup

Kishore Nallan

08/13/2024, 11:43 AM

I still think that the case where tokens_matched is zero, is when we don't find that match in keyword search. The highlighting will happen neverthless because any word that is present in the query and in the field will be highlighted. I will confirm after testing.

Óscar Vicente

08/13/2024, 11:53 AM

Hmmm, then there's something else going on. Like if it's found through vector search it won't apply keyword search and the other way around. Which will render the rank_fusion_score unusable for sorting also as it won't be truly consistent. The other thing I found out, it's that if even the VectorDistance is null, the distance can be quite high anyway. Here's the cosine similarity calculated with this package: https://learn.microsoft.com/en-us/dotnet/api/system.numerics.tensors.tensorprimitives.[…]imilarity?view=net-8.0&viewFallbackFrom=dotnet-plat-ext-8.0

Kishore Nallan

08/13/2024, 11:58 AM

Like if it's found through vector search it won't apply keyword search and the other way around.

That's the purpose of the rank fusion score method. If a record is found in both methods, it will get a higher score than otherwise.

Kishore Nallan

08/13/2024, 11:59 AM

Nevertheless I will check again with the data and confirm.

Óscar Vicente

08/13/2024, 12:15 PM

I mean if it's found with one method, it won't apply the second and viceversa

Kishore Nallan

08/13/2024, 12:16 PM

Yes. You want to boost records that are found by both methods. That's how rank fusion works.

Óscar Vicente

08/13/2024, 12:27 PM

But the problem is that those records would actually be found by both but it's not computed. And it's giving a value up to 0.5 when the actual value if both are computed could be a 0.8 or 0.9 and be way higher. Thus the ordering is wrong and truly comparable.

Kishore Nallan

08/14/2024, 11:05 AM

Can you post the collection schema? The one you posted earlier in the other thread does not have a embedding field.

Óscar Vicente

08/14/2024, 11:05 AM

Sure, let me finish some things first!

Óscar Vicente

08/14/2024, 2:43 PM

Here's a subset, but I get different results Idk why.

Copy code

{
  "name": "test",
  "fields": [
    {
      "name": "embedding",
      "type": "float[]",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "",
      "hnsw_params": {
        "M": 16,
        "ef_construction": 200
      },
      "num_dim": 1024,
      "stem": false,
      "store": true,
      "vec_dist": "cosine"
    },
    {
      "name": "externalStatusCode",
      "type": "string",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "",
      "stem": false,
      "store": true
    },
    {
      "name": "cpv",
      "type": "string[]",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "",
      "stem": false,
      "store": true
    },
    {
      "name": "contractingPartyName",
      "type": "string",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "es",
      "stem": false,
      "store": true
    },
    {
      "name": "externalReference",
      "type": "string",
      "facet": false,
      "optional": true,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "es",
      "stem": false,
      "store": true
    },
    {
      "name": "title",
      "type": "string",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": true,
      "infix": true,
      "locale": "es",
      "stem": true,
      "store": true
    }
  ],
  "default_sorting_field": "",
  "enable_nested_fields": false,
  "symbols_to_index": [],
  "token_separators": []
}

I no longer understand what's happening in the imagem but here's the query, removing the embedding (text-embeddings-3-large) (same for both runs).

Copy code

{
  "searches": [
    {
      "collection": "test",
      "vector_query": "embedding:([....],flat_search_cutoff:0,alpha:0.5,ef:128)",
      "q": "Pistas de pádel",
      "query_by": "title,externalReference,contractingPartyName",
      "prefix": false,
      "query_by_weights": "5,10,5",
      "sort_by": "_text_match:desc",
      "prioritize_exact_match": true,
      "prioritize_token_position": true,
      "page": 0,
      "per_page": 100,
      "highlight_full_fields": "title",
      "exhaustive_search": false,
      "num_typos": 0,
      "typo_tokens_threshold": 0,
      "drop_tokens_threshold": 0
    }
  ]
}

Helena Merk

08/16/2024, 11:23 PM

Was there any conclusion here? I have a similar question

Kishore Nallan

08/17/2024, 5:01 PM

@Óscar Vicente Can you please provide the actual embedding used in the search query shown in the screenshot so I can debug?

Copy code

"vector_query":"embedding:([....]

Óscar Vicente

08/19/2024, 7:39 AM

Embeddings

Sin título

Óscar Vicente

08/19/2024, 7:39 AM

There you go!

Kishore Nallan

08/20/2024, 12:31 PM

This is indeed a case of some records matching only on vector search but not on keyword search. Let's take the

Cimentacion

document as example. It does not contain the word

pádel

so it does not match in keyword search. Nevertheless we highlight any words in the query present in the document, even if that document was found via vector search. This is why

tokens_matched

is also 0. Look into

drop_tokens_threshold

if you want keyword search to match partial words from the query.

Óscar Vicente

08/21/2024, 7:27 AM

My poing was more the other way around. In the screenshots, you can see two different cases: • VectorDistance is null (4th row in the screenshot). This means it was found only using keyword search. However, if you see the 4th column, you can see it gives a 0.55 cosine distance (I calculated it) which is a solid match. • VectorDistance is not null (first 3 rows), the tokens_matched are always 0. But as you can see in the screenshot, there's matched_tokens. But given the algorithm for the rank_fusion_score using an alpha of 0.5, the first row has also being found with the keyword search. There's two problems here: • tokens_matched being 0 with matched_tokens having values. Seems like a problem when there's records found with vector search ONLY/AND with keyword search. • The rank_fusion_score sorting being completely wrong as the records that are found with one type of search that SHOULD have been found also by the other doesn't have an score as high as it should. All of these rows should have been found by both algorithms, thus their score should have been way higher. So if a record is found with both algorithm, it would rank higher than one that's only found by one but should have been found by two even if it matches more. I think that this problem is more relevant with a huge dataset, where optimizations to avoid calculating the vector_distance happen more often and with more items. The real dataset is 3M+ and more than 20Gb big. I gave you a smaller one for the purpose of being able to share it so maybe you can test it correctly. Maybe I can share the real one with you, but you will need 32gb of RAM or more (I'm at 64gb)

Kishore Nallan

08/21/2024, 8:14 AM

> tokens_matched being 0 with matched_tokens having values. The

matched_tokens

are populated by the highlight logic that runs, so the definite indicator whether the record was found via keyword search is the

tokens_matched

integer value. When it's 0, then that record was only found via vector search. During highlighting, we will add any tokens in the query that are found in the field values to

matched_tokens

-- this does not mean that the record was found via keyword search.

Kishore Nallan

08/21/2024, 8:19 AM

All of these rows should have been found by both algorithms, thus their score should have been way higher.

If you tweaked the

drop_tokens_threshold

value you will let keyword search find these partial matches. It does not do so because it found enough matches that had all the query token. Likewise with vector search, if you increased

you could find those records with vector_distance null showing up: again, there were vector with smaller distance ahead.

Kishore Nallan

08/21/2024, 8:25 AM

But given the algorithm for the rank_fusion_score using an alpha of 0.5, the first row has also being found with the keyword search.

I suspect that the

0.5056718

value is a float precision issue. When I run the sample query you gave, I see only

0.5

exactly.

Óscar Vicente

08/21/2024, 8:48 AM

But none of them are partial matches and even if the vector distance is smaller is still relevant to the sorting. If both of them should have a value as they trully matches, the sorting weight will be different. The thing is that the final set joining both kind of searches should have the algorithm run on them to give a precise sorting score otherwise the rank is nor accurate just good enough until it's not good enough. In my case, it needs to be accurate but I guess that an e-store doesn't need that level of accuracy. I'm using a 3M+ dataset so these isues are way more obvious as its more likely to happen that a relevant record can end up below a less relevant record (which is the issue I'm tracking as my users report it). It's found, it just end up in and incorrect position after sorting.

Kishore Nallan

08/21/2024, 8:54 AM

But none of them are partial matches

Does this happen in the smaller set you shared? Atleast for the query you've shared, when I checked the hits with

tokens_matched: 0

they were all partial matches.

Óscar Vicente

08/21/2024, 8:56 AM

Hmmm, let me check again. If not, can I share with you the 20gb dataset? I'll reduce the fields and test it before, so maybe it will be 15gb. Not ideal, but I think it's the only way. And you'll get a weird dataset to test upon xD

Kishore Nallan

08/21/2024, 8:57 AM

It will not be possible to debug on such a large dataset. I hope we can reproduce on the smaller set. Maybe you can pick that record that occurs in the large set into the smaller set to see if that produces the same behavior?

Óscar Vicente

08/21/2024, 8:58 AM

Me bet is that it will happen that obvious in the bigger one, but I'll try

Kishore Nallan

08/21/2024, 8:58 AM

Btw, the query you've shared is

Pistas de pádel

but in the screenshot, the

Construccion

word is highlighted, so that is a different query?

Óscar Vicente

08/21/2024, 8:58 AM

I think it's a "scale" problem

Óscar Vicente

08/21/2024, 9:00 AM

Yeah, could be, but it happens also with only "Pistas de pádel". Here's the updated screenshot

Óscar Vicente

08/21/2024, 9:02 AM

I can reproduce it consistently with many results

Óscar Vicente

08/21/2024, 9:03 AM

message has been deleted

Kishore Nallan

08/21/2024, 9:07 AM

Ok so it's possible that there are records that match all keywords exactly but still fall outside the request per_page number. And the same record could be found in vector search.

Kishore Nallan

08/21/2024, 9:07 AM

What is the

found

when you do only keyword search?

Óscar Vicente

08/21/2024, 9:24 AM

403 out of 3M

Kishore Nallan

08/21/2024, 9:27 AM

Right, so that's why tokens matched is zero. That record must appear outside the per page limit of 100 but getting picked in vector search. As you said, maybe we need to do a second pass at scoring records from both phases.

Óscar Vicente

08/21/2024, 9:30 AM

It will be ideal, at least for our use case where we trully need accurate sorting, not ballpark. But I understand it to be optional if it's too computational expensive. Anyway, I can try to reproduce it and give you a better dataset to debug in case you need it

Kishore Nallan

08/21/2024, 9:47 AM

No, I think the larger set is not needed. I understand what's happening.

🤘 1

Óscar Vicente

08/21/2024, 11:22 AM

Thank you for the help! I hope you could implement something!

Leon Wolf

08/22/2024, 4:37 PM

We actually have almost the same requirement as @Óscar Vicente and i'm wondering how we can calculate such a score. Is there anything we could do in the meantime, while the two-pass-problem persists? Cosine-similarity is difficult to use as a raw value, fusion score isn't helpful, and text_match_info doesn't give much extra signal. We probably have different datasets, but happy to brainstorm together as well @Óscar Vicente

Kishore Nallan

08/22/2024, 5:13 PM

We will be working on the two pass in a week or two.

❤️ 1

🙌 2

Óscar Vicente

08/23/2024, 7:10 AM

@Leon Wolf I'm going with:

Copy code

queryTokens / matchedTokens * 0.6 + (vector_distance ?? 0) * 0.4

Since both the division and the vector distance are within the 0-1 range and are representative of how good the match is. For matched tokens, for now my workaround is to take all the unique matched tokens from the highlight array. In C#, I'm using a HashSet to do it. For Vector distance, either if it's null I'll take a 0 or I'll get the embedding and calculate the similarity to give the score. The only problem, I can't use this value for truly sorting the source list until the two pass is ready, but it's a good enough solution for now

Óscar Vicente

11/05/2024, 12:44 PM

Hi @Kishore Nallan, any update on the two pass?

Kishore Nallan

11/05/2024, 12:50 PM

Yes, there is a new

rerank_hybrid_matches

boolean flag in

28.0.rc16

-- please try it out and let me know how it works.

🙌 1

Óscar Vicente

11/05/2024, 3:12 PM

Just tested it using

"rerank_hybrid_matches ": *true*

but it still shows the previous behavior of missing the VectorDistance field and having

"tokens_matched": 0

Kishore Nallan

11/05/2024, 3:16 PM

I'm not sure if we are adding those fields now but the scores are definitely recalculated. For e.g. when a vector search result does not occur in keyword search, we now compute the score for keyword search by looking at the components.

Óscar Vicente

11/05/2024, 3:18 PM

It will be awersome to have those fields for calculating the score, otherwise I'll have to fetch always the embeddings fields causing a lot of bandwidth and latency.

Kishore Nallan

11/05/2024, 3:19 PM

You are still going to compute a client score?

Óscar Vicente

11/05/2024, 3:19 PM

I still need a score to show a "Relevance Score" to the client

Kishore Nallan

11/05/2024, 3:20 PM

Ok, but ranking wise, does it look better now?

Óscar Vicente

11/05/2024, 3:21 PM

The queries I have for testing still show the same results and ordering and without having the fields I can't really double check

Kishore Nallan

11/05/2024, 3:21 PM

Ok will check tomorrow and confirm.

Óscar Vicente

11/05/2024, 3:21 PM

Thanks!

Óscar Vicente

11/05/2024, 3:25 PM

Also, could this affect the use of _vector_distance as a sort_by option? It seems that is null and it doesn't work with buckets. Or is it a different issue?

Kishore Nallan

11/05/2024, 3:32 PM

We don't support buckets with vector distance. Where did you see that?

Óscar Vicente

11/05/2024, 3:33 PM

Hmm, nowhere. I understood it from the documentation. I thought buckets worked with anything

Kishore Nallan

11/05/2024, 3:34 PM

Buckets are implemented only for

_text_match

at the moment.

✅ 1

Kishore Nallan

11/05/2024, 3:49 PM

Actually I just checked the code and we are populating the missing vector distance and text match score. I need a sample dataset and query to figure out what's going wrong.

Óscar Vicente

11/05/2024, 3:49 PM

rc16?

Kishore Nallan

11/05/2024, 3:53 PM

Yes

Óscar Vicente

11/05/2024, 4:39 PM

It's this one: https://typesense-community.slack.com/archives/C01P749MET0/p1723646628789019?thread_ts=1723108317.501789&cid=C01P749MET0

Óscar Vicente

11/07/2024, 9:00 AM

Hi there! Did you have a chance to try it?

Óscar Vicente

11/07/2024, 9:12 AM

Maybe it's not in rc16, but another one

Kishore Nallan

11/07/2024, 10:43 AM

Are you still sending explict query via vector_query instead of relying on in-built embedding?

Óscar Vicente

11/07/2024, 10:48 AM

Yup

Kishore Nallan

11/07/2024, 10:57 AM

I just ran that query. With per_page: 100, all 100 hits have the

vector_distance

score. However, some are indeed having

"tokens_matched": 0

-- I will check why that's so, but all of them have vector distance populated now.

Óscar Vicente

11/07/2024, 11:00 AM

That's the query I'm using. The very first result doesn't have vector distance

Sin título

Óscar Vicente

11/07/2024, 11:02 AM

FML, now I saw the error. I copy pasted from slack, so I got:

"rerank_hybrid_matches ": *true*,

Instead of:

"rerank_hybrid_matches": *true*,

Kishore Nallan

11/07/2024, 11:03 AM

So it works now?

Óscar Vicente

11/07/2024, 11:03 AM

Vector distance yes, tokens matched no

Kishore Nallan

11/07/2024, 11:03 AM

I'm checking on the tokens_matched now

🙌 1

Óscar Vicente

11/07/2024, 11:03 AM

Nice!

Kishore Nallan

11/07/2024, 1:02 PM

I've fixed this issue, will share a build later today / tomorrow. The value was getting overridden.

✅ 1

Óscar Vicente

11/07/2024, 3:11 PM

Awesome! That was fast

Óscar Vicente

11/08/2024, 4:38 PM

It seems to be fixed in the rc18. However, if I reuse a request that I used in the previous version I still get tokens matched 0. Is there any cache I can clean?

Kishore Nallan

11/08/2024, 4:39 PM

What do you mean by reuse?

Kishore Nallan

11/08/2024, 4:39 PM

Yeah it's fixed in rc18

Óscar Vicente

11/08/2024, 4:39 PM

Like, If I send the very same json

Kishore Nallan

11/08/2024, 4:39 PM

For other requests, it works?

Óscar Vicente

11/08/2024, 4:40 PM

yup

Kishore Nallan

11/08/2024, 4:40 PM

Unless you use

use_cache

parameter, nothing gets cached. Even use cache only caches for 60s by default.

Óscar Vicente

11/08/2024, 4:54 PM

Then there's something not working:

Copy code

{
  "document": {
    "awardedProposalAmounts": [],
    "awardedProposalPartyNames": [],
    "awardedProposalPartyVatIds": [],
    "contractTypeId": "3",
    "contractingPartyId": [
      "31604070204167",
      "P5030300G"
    ],
    "contractingPartyIdHierarchy": [
      "P5030300G",
      "",
      "",
      "",
      "",
      ""
    ],
    "contractingPartyName": "Consejería de Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento de Zaragoza",
    "contractingPartyNameHierarchy": [
      "Consejería de Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento de Zaragoza",
      "Zaragoza",
      "Ayuntamientos",
      "Aragón",
      "ENTIDADES LOCALES",
      "Sector Público"
    ],
    "contractingSystemTypeId": "0",
    "cpv": [
      "45212210",
      "45212200"
    ],
    "documentsCount": 7,
    "externalReference": "0043108-24",
    "externalStatusCode": "EV",
    "id": "licitacionesPerfilContratante/15630264",
    "internalId": "7e97cd55-4e93-5b6a-bb72-b93e2fa9b991",
    "link": "<https://contrataciondelestado.es/wps/poc?uri=deeplink:detalle_licitacion&idEvl=WQALDEBXv%2BA2wEhQbcAqug%3D%3D>",
    "locationNutsCode": "ES243",
    "locationNutsName": "Zaragoza",
    "lotsCount": 0,
    "parentsIds": [],
    "parentsNames": [
      "Zaragoza",
      "Ayuntamientos",
      "Zaragoza",
      "Aragón",
      "ENTIDADES LOCALES",
      "Sector Público"
    ],
    "procedureTypeId": "9",
    "projectBudgetWithTaxes": 316896.36,
    "projectBudgetWithoutTaxes": 261897.82,
    "projectPlannedPeriodDuration": 5,
    "projectPlannedPeriodDurationUnitCode": "MON",
    "projectPlannedPeriodEndDate": -62135596800,
    "projectPlannedPeriodStartDate": -62135596800,
    "tenderSubmissionEndDateTime": 1727701140,
    "title": "Dos pistas de padel cubiertas en el Barrio de Casetas. Zaragoza. Convenio DPZ.",
    "updated": 1729253205
  },
  "highlight": {
    "contractingPartyName": {
      "matched_tokens": [
        "de",
        "de"
      ],
      "snippet": "Consejería <mark>de</mark> Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento <mark>de</mark> Zaragoza"
    },
    "title": {
      "matched_tokens": [
        "pistas",
        "de",
        "de"
      ],
      "snippet": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ.",
      "value": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ."
    }
  },
  "highlights": [
    {
      "field": "title",
      "matched_tokens": [
        "pistas",
        "de",
        "de"
      ],
      "snippet": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ.",
      "value": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ."
    },
    {
      "field": "contractingPartyName",
      "matched_tokens": [
        "de",
        "de"
      ],
      "snippet": "Consejería <mark>de</mark> Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento <mark>de</mark> Zaragoza"
    }
  ],
  "hybrid_search_info": {
    "rank_fusion_score": 0.5032680034637451
  },
  "text_match": 3315704463360,
  "text_match_info": {
    "best_field_score": "1618996320",
    "best_field_weight": 0,
    "fields_matched": 0,
    "num_tokens_dropped": 3,
    "score": "3315704463360",
    "tokens_matched": 0,
    "typo_prefix_score": 159
  },
  "vector_distance": 0.4084985852241516
}

You can see there's matched_tokens, but "tokens_matched" is 0. It only happens if the query contains an accent or at least that's what it seems:

Copy code

"q": "Pistas de p\u00E1del"

If I remove the accent everything has tokens_matched, but the results are wildly different.

Copy code

"q": "Pistas de padel"

Kishore Nallan

11/08/2024, 4:56 PM

Ok will check

🙌 1

👀 1

Kishore Nallan

11/09/2024, 1:57 PM

Not able to reproduce this. Can you paste the actual query you are sending as curl request with the full payload?

Óscar Vicente

11/18/2024, 9:18 AM

Sin título

Óscar Vicente

11/18/2024, 9:18 AM

Sure Thing!

Óscar Vicente

11/22/2024, 3:21 PM

@Kishore Nallan Were you able to reproduce it?

Kishore Nallan

11/25/2024, 8:23 AM

Not yet, will check and get back to you in a day or two.

🤘 1

Kishore Nallan

11/27/2024, 7:03 AM

When I index the docs you had sent earlier locally and run that query, I actually don't see any

"tokens_matched": 0

Óscar Vicente

11/27/2024, 7:58 AM

This is a snippet of the first three results I get. The second one doesn't have tokens matched but it has matched tokens. I'll review it again deeper.

Sin título

Óscar Vicente

11/27/2024, 8:26 AM

With this dataset (It's a bit bigger, but I had to do it for testing)

Óscar Vicente

11/27/2024, 8:26 AM

Notice that if you change from:

"q": "Pistas de pádel",

"q": "Pistas de padel",

it no longer happens. It happens with the second document, the one with the field

"externalReference": "0043108-24"

Sin título

Kishore Nallan

11/27/2024, 10:22 AM

Ok I can reproduce with this dataset. Will check.

Óscar Vicente

11/27/2024, 10:23 AM

Sorry about the wierd datasets 🤣 But hey, at least we can catch the edge cases. I think it has to do with accents also.

Kishore Nallan

11/27/2024, 10:26 AM

I suspect that the second pass text matching logic has limited information on computing the text match score. I will confirm, but in that case, there is not much we can do.

👀 1

Óscar Vicente

11/27/2024, 11:11 AM

In that case, fixing the matched tokens from the highlights with accents can do the trick for me

Kishore Nallan

11/28/2024, 8:28 AM

Should be fixed in

28.0.rc23

-- please check and confirm.

Óscar Vicente

11/28/2024, 8:28 AM

Sure! I'll reach back in 2h!

👍 1

Óscar Vicente

11/28/2024, 4:14 PM

Now it works, but it's reporting the matched tokens incorrectly because the accent. It's not taking it into account.

Kishore Nallan

11/29/2024, 2:35 AM

Yes that's because you're using the

es

locale which preserves the diacritics. So the word in the query and in the document are treated differently. The

fields_matched

now matches the exact number of unique highlighted words. This is the best we can do in a second pass re-ranking.

Óscar Vicente

11/29/2024, 8:21 AM

Maybe that's the key for the problem with locales! Depending on the language, diacritics could mean that it's another letter or just the same letter. In most of the cases, you want to match those words with and without diacritics. Anyway, that's for another thread.

Kishore Nallan

11/29/2024, 8:22 AM

If you don't set a

es

locale, diacritics are removed.

Óscar Vicente

11/29/2024, 8:23 AM

Do you lose other features? Or it just affect the diacritics?

Kishore Nallan

11/29/2024, 8:23 AM

In some languages those have specific meaning, so we have resorted to not removing them when a specific locale is used.

Kishore Nallan

11/29/2024, 8:23 AM

For Spanish, I think not setting a locale should work out of the box.

Óscar Vicente

11/29/2024, 8:25 AM

Got it, so if it only affect diacritics, I'll do it. I thought it also affects the way you normalize the words and things like plurals and so on

👍 1

Óscar Vicente

11/29/2024, 8:35 AM

Last question, does it affect how stemming works?

Kishore Nallan

11/29/2024, 8:54 AM

We use the snowball stemmer library. So you have to try it out. Maybe the rules are different, in which case, yes that would affect it.

Óscar Vicente

11/29/2024, 9:35 AM

It is, f. That's why I got so much inconsistent results. It will match the results, but not show it as matched hence highlight will be broken and we can't tell why is matching what

Kishore Nallan

11/29/2024, 9:35 AM

Are you using stemming primarily for handling plurals

Óscar Vicente

11/29/2024, 9:37 AM

We tried, but not currently. We had to remove it because we got inconsistent results. And this is why. I'll have to test it again now that I know this.

Kishore Nallan

11/29/2024, 9:37 AM

If it's primarily plurals a dictionary based approach is better. We are working on a feature that will be available next week that will help.

Óscar Vicente

11/29/2024, 9:37 AM

I'm mixing the original discussion anyway. We are discussing accents and things in here in case you want to move the discussion: https://typesense-community.slack.com/archives/C01P749MET0/p1729200863886629

👍 1

Óscar Vicente

11/29/2024, 9:38 AM

That would help a lot!

Óscar Vicente

11/29/2024, 9:38 AM

But won't be perfect for verbs

Óscar Vicente

11/29/2024, 9:39 AM

But hey, for it will be better anyway

Open in Slack

Previous Next