Óscar Vicente
08/08/2024, 9:11 AMKishore Nallan
08/08/2024, 9:56 AMtext_match_info
object in the response that describes some additional information like number of fields matched which will offer explanation.Óscar Vicente
08/08/2024, 9:58 AMKishore Nallan
08/08/2024, 10:04 AMÓscar Vicente
08/08/2024, 10:05 AMKishore Nallan
08/08/2024, 10:06 AMÓscar Vicente
08/08/2024, 10:07 AMKishore Nallan
08/08/2024, 10:13 AMÓscar Vicente
08/08/2024, 10:15 AMKishore Nallan
08/08/2024, 10:16 AMtext_match_info
Óscar Vicente
08/08/2024, 10:17 AMKishore Nallan
08/08/2024, 10:19 AMÓscar Vicente
08/08/2024, 10:26 AMKishore Nallan
08/08/2024, 10:29 AMÓscar Vicente
08/08/2024, 10:29 AMÓscar Vicente
08/12/2024, 3:20 PM{
"document": {
"title": "Mejoras en la pista de pádel del Casal"
},
"highlight": {
"title": {
"matched_tokens": [
"pista",
"pádel"
],
"snippet": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal",
"value": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal"
}
},
"highlights": [
{
"field": "title",
"matched_tokens": [
"pista",
"pádel"
],
"snippet": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal",
"value": "Mejoras en la <mark>pista</mark> de <mark>pádel</mark> del Casal"
}
],
"hybrid_search_info": {
"rank_fusion_score": 0.6666666865348816
},
"text_match": 1042983595,
"text_match_info": {
"best_field_score": "509269",
"best_field_weight": 85,
"fields_matched": 3,
"num_tokens_dropped": 2,
"score": "1042983595",
"tokens_matched": 0,
"typo_prefix_score": 255
},
"vector_distance": 0.5202423930168152
}
Kishore Nallan
08/12/2024, 3:24 PMÓscar Vicente
08/13/2024, 8:02 AM"q": "pistas de pádel",
"query_by": "title,externalReference,contractingPartyName,locationNutsName,locationNutsCode",
That's why I have highlights with marks and matched_tokensKishore Nallan
08/13/2024, 8:03 AMÓscar Vicente
08/13/2024, 8:05 AMÓscar Vicente
08/13/2024, 8:05 AMKishore Nallan
08/13/2024, 11:20 AMÓscar Vicente
08/13/2024, 11:39 AMKishore Nallan
08/13/2024, 11:41 AMÓscar Vicente
08/13/2024, 11:41 AMKishore Nallan
08/13/2024, 11:41 AMBut when the vector_distance is null, I don't get the fusion_rank_score.The rank_fusion_score is 0.5 right?
Óscar Vicente
08/13/2024, 11:42 AMKishore Nallan
08/13/2024, 11:43 AMÓscar Vicente
08/13/2024, 11:53 AMKishore Nallan
08/13/2024, 11:58 AMLike if it's found through vector search it won't apply keyword search and the other way around.That's the purpose of the rank fusion score method. If a record is found in both methods, it will get a higher score than otherwise.
Kishore Nallan
08/13/2024, 11:59 AMÓscar Vicente
08/13/2024, 12:15 PMKishore Nallan
08/13/2024, 12:16 PMÓscar Vicente
08/13/2024, 12:27 PMKishore Nallan
08/14/2024, 11:05 AMÓscar Vicente
08/14/2024, 11:05 AMÓscar Vicente
08/14/2024, 2:43 PM{
"name": "test",
"fields": [
{
"name": "embedding",
"type": "float[]",
"facet": false,
"optional": false,
"index": true,
"sort": false,
"infix": false,
"locale": "",
"hnsw_params": {
"M": 16,
"ef_construction": 200
},
"num_dim": 1024,
"stem": false,
"store": true,
"vec_dist": "cosine"
},
{
"name": "externalStatusCode",
"type": "string",
"facet": false,
"optional": false,
"index": true,
"sort": false,
"infix": false,
"locale": "",
"stem": false,
"store": true
},
{
"name": "cpv",
"type": "string[]",
"facet": false,
"optional": false,
"index": true,
"sort": false,
"infix": false,
"locale": "",
"stem": false,
"store": true
},
{
"name": "contractingPartyName",
"type": "string",
"facet": false,
"optional": false,
"index": true,
"sort": false,
"infix": false,
"locale": "es",
"stem": false,
"store": true
},
{
"name": "externalReference",
"type": "string",
"facet": false,
"optional": true,
"index": true,
"sort": false,
"infix": false,
"locale": "es",
"stem": false,
"store": true
},
{
"name": "title",
"type": "string",
"facet": false,
"optional": false,
"index": true,
"sort": true,
"infix": true,
"locale": "es",
"stem": true,
"store": true
}
],
"default_sorting_field": "",
"enable_nested_fields": false,
"symbols_to_index": [],
"token_separators": []
}
I no longer understand what's happening in the imagem but here's the query, removing the embedding (text-embeddings-3-large) (same for both runs).
{
"searches": [
{
"collection": "test",
"vector_query": "embedding:([....],flat_search_cutoff:0,alpha:0.5,ef:128)",
"q": "Pistas de pádel",
"query_by": "title,externalReference,contractingPartyName",
"prefix": false,
"query_by_weights": "5,10,5",
"sort_by": "_text_match:desc",
"prioritize_exact_match": true,
"prioritize_token_position": true,
"page": 0,
"per_page": 100,
"highlight_full_fields": "title",
"exhaustive_search": false,
"num_typos": 0,
"typo_tokens_threshold": 0,
"drop_tokens_threshold": 0
}
]
}
Helena Merk
08/16/2024, 11:23 PMKishore Nallan
08/17/2024, 5:01 PM"vector_query":"embedding:([....]
Óscar Vicente
08/19/2024, 7:39 AMÓscar Vicente
08/19/2024, 7:39 AMKishore Nallan
08/20/2024, 12:31 PMCimentacion
document as example. It does not contain the word pádel
so it does not match in keyword search. Nevertheless we highlight any words in the query present in the document, even if that document was found via vector search. This is why tokens_matched
is also 0.
Look into drop_tokens_threshold
if you want keyword search to match partial words from the query.Óscar Vicente
08/21/2024, 7:27 AMKishore Nallan
08/21/2024, 8:14 AMmatched_tokens
are populated by the highlight logic that runs, so the definite indicator whether the record was found via keyword search is the tokens_matched
integer value. When it's 0, then that record was only found via vector search.
During highlighting, we will add any tokens in the query that are found in the field values to matched_tokens
-- this does not mean that the record was found via keyword search.Kishore Nallan
08/21/2024, 8:19 AMAll of these rows should have been found by both algorithms, thus their score should have been way higher.If you tweaked the
drop_tokens_threshold
value you will let keyword search find these partial matches. It does not do so because it found enough matches that had all the query token.
Likewise with vector search, if you increased k
you could find those records with vector_distance null showing up: again, there were vector with smaller distance ahead.Kishore Nallan
08/21/2024, 8:25 AMBut given the algorithm for the rank_fusion_score using an alpha of 0.5, the first row has also being found with the keyword search.I suspect that the
0.5056718
value is a float precision issue. When I run the sample query you gave, I see only 0.5
exactly.Óscar Vicente
08/21/2024, 8:48 AMKishore Nallan
08/21/2024, 8:54 AMBut none of them are partial matchesDoes this happen in the smaller set you shared? Atleast for the query you've shared, when I checked the hits with
tokens_matched: 0
they were all partial matches.Óscar Vicente
08/21/2024, 8:56 AMKishore Nallan
08/21/2024, 8:57 AMÓscar Vicente
08/21/2024, 8:58 AMKishore Nallan
08/21/2024, 8:58 AMPistas de pádel
but in the screenshot, the Construccion
word is highlighted, so that is a different query?Óscar Vicente
08/21/2024, 8:58 AMÓscar Vicente
08/21/2024, 9:00 AMÓscar Vicente
08/21/2024, 9:02 AMÓscar Vicente
08/21/2024, 9:03 AMKishore Nallan
08/21/2024, 9:07 AMKishore Nallan
08/21/2024, 9:07 AMfound
when you do only keyword search?Óscar Vicente
08/21/2024, 9:24 AMKishore Nallan
08/21/2024, 9:27 AMÓscar Vicente
08/21/2024, 9:30 AMKishore Nallan
08/21/2024, 9:47 AMÓscar Vicente
08/21/2024, 11:22 AMLeon Wolf
08/22/2024, 4:37 PMKishore Nallan
08/22/2024, 5:13 PMÓscar Vicente
08/23/2024, 7:10 AMqueryTokens / matchedTokens * 0.6 + (vector_distance ?? 0) * 0.4
Since both the division and the vector distance are within the 0-1 range and are representative of how good the match is. For matched tokens, for now my workaround is to take all the unique matched tokens from the highlight array. In C#, I'm using a HashSet to do it. For Vector distance, either if it's null I'll take a 0 or I'll get the embedding and calculate the similarity to give the score. The only problem, I can't use this value for truly sorting the source list until the two pass is ready, but it's a good enough solution for nowÓscar Vicente
11/05/2024, 12:44 PMKishore Nallan
11/05/2024, 12:50 PMrerank_hybrid_matches
boolean flag in 28.0.rc16
-- please try it out and let me know how it works.Óscar Vicente
11/05/2024, 3:12 PM"rerank_hybrid_matches ": *true*
but it still shows the previous behavior of missing the VectorDistance field and having
"tokens_matched": 0
Kishore Nallan
11/05/2024, 3:16 PMÓscar Vicente
11/05/2024, 3:18 PMKishore Nallan
11/05/2024, 3:19 PMÓscar Vicente
11/05/2024, 3:19 PMKishore Nallan
11/05/2024, 3:20 PMÓscar Vicente
11/05/2024, 3:21 PMKishore Nallan
11/05/2024, 3:21 PMÓscar Vicente
11/05/2024, 3:21 PMÓscar Vicente
11/05/2024, 3:25 PMKishore Nallan
11/05/2024, 3:32 PMÓscar Vicente
11/05/2024, 3:33 PMKishore Nallan
11/05/2024, 3:34 PM_text_match
at the moment.Kishore Nallan
11/05/2024, 3:49 PMÓscar Vicente
11/05/2024, 3:49 PMKishore Nallan
11/05/2024, 3:53 PMÓscar Vicente
11/05/2024, 4:39 PMÓscar Vicente
11/07/2024, 9:00 AMÓscar Vicente
11/07/2024, 9:12 AMKishore Nallan
11/07/2024, 10:43 AMÓscar Vicente
11/07/2024, 10:48 AMKishore Nallan
11/07/2024, 10:57 AMvector_distance
score. However, some are indeed having "tokens_matched": 0
-- I will check why that's so, but all of them have vector distance populated now.Óscar Vicente
11/07/2024, 11:00 AMÓscar Vicente
11/07/2024, 11:02 AM"rerank_hybrid_matches ": *true*,
Instead of:
"rerank_hybrid_matches": *true*,
Kishore Nallan
11/07/2024, 11:03 AMÓscar Vicente
11/07/2024, 11:03 AMKishore Nallan
11/07/2024, 11:03 AMÓscar Vicente
11/07/2024, 11:03 AMKishore Nallan
11/07/2024, 1:02 PMÓscar Vicente
11/07/2024, 3:11 PMÓscar Vicente
11/08/2024, 4:38 PMKishore Nallan
11/08/2024, 4:39 PMKishore Nallan
11/08/2024, 4:39 PMÓscar Vicente
11/08/2024, 4:39 PMKishore Nallan
11/08/2024, 4:39 PMÓscar Vicente
11/08/2024, 4:40 PMKishore Nallan
11/08/2024, 4:40 PMuse_cache
parameter, nothing gets cached. Even use cache only caches for 60s by default.Óscar Vicente
11/08/2024, 4:54 PM{
"document": {
"awardedProposalAmounts": [],
"awardedProposalPartyNames": [],
"awardedProposalPartyVatIds": [],
"contractTypeId": "3",
"contractingPartyId": [
"31604070204167",
"P5030300G"
],
"contractingPartyIdHierarchy": [
"P5030300G",
"",
"",
"",
"",
""
],
"contractingPartyName": "Consejería de Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento de Zaragoza",
"contractingPartyNameHierarchy": [
"Consejería de Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento de Zaragoza",
"Zaragoza",
"Ayuntamientos",
"Aragón",
"ENTIDADES LOCALES",
"Sector Público"
],
"contractingSystemTypeId": "0",
"cpv": [
"45212210",
"45212200"
],
"documentsCount": 7,
"externalReference": "0043108-24",
"externalStatusCode": "EV",
"id": "licitacionesPerfilContratante/15630264",
"internalId": "7e97cd55-4e93-5b6a-bb72-b93e2fa9b991",
"link": "<https://contrataciondelestado.es/wps/poc?uri=deeplink:detalle_licitacion&idEvl=WQALDEBXv%2BA2wEhQbcAqug%3D%3D>",
"locationNutsCode": "ES243",
"locationNutsName": "Zaragoza",
"lotsCount": 0,
"parentsIds": [],
"parentsNames": [
"Zaragoza",
"Ayuntamientos",
"Zaragoza",
"Aragón",
"ENTIDADES LOCALES",
"Sector Público"
],
"procedureTypeId": "9",
"projectBudgetWithTaxes": 316896.36,
"projectBudgetWithoutTaxes": 261897.82,
"projectPlannedPeriodDuration": 5,
"projectPlannedPeriodDurationUnitCode": "MON",
"projectPlannedPeriodEndDate": -62135596800,
"projectPlannedPeriodStartDate": -62135596800,
"tenderSubmissionEndDateTime": 1727701140,
"title": "Dos pistas de padel cubiertas en el Barrio de Casetas. Zaragoza. Convenio DPZ.",
"updated": 1729253205
},
"highlight": {
"contractingPartyName": {
"matched_tokens": [
"de",
"de"
],
"snippet": "Consejería <mark>de</mark> Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento <mark>de</mark> Zaragoza"
},
"title": {
"matched_tokens": [
"pistas",
"de",
"de"
],
"snippet": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ.",
"value": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ."
}
},
"highlights": [
{
"field": "title",
"matched_tokens": [
"pistas",
"de",
"de"
],
"snippet": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ.",
"value": "Dos <mark>pistas</mark> <mark>de</mark> padel cubiertas en el Barrio <mark>de</mark> Casetas. Zaragoza. Convenio DPZ."
},
{
"field": "contractingPartyName",
"matched_tokens": [
"de",
"de"
],
"snippet": "Consejería <mark>de</mark> Urbanismo, Infraestructuras, Energía y Vivienda del Ayuntamiento <mark>de</mark> Zaragoza"
}
],
"hybrid_search_info": {
"rank_fusion_score": 0.5032680034637451
},
"text_match": 3315704463360,
"text_match_info": {
"best_field_score": "1618996320",
"best_field_weight": 0,
"fields_matched": 0,
"num_tokens_dropped": 3,
"score": "3315704463360",
"tokens_matched": 0,
"typo_prefix_score": 159
},
"vector_distance": 0.4084985852241516
}
You can see there's matched_tokens, but "tokens_matched" is 0. It only happens if the query contains an accent or at least that's what it seems:
"q": "Pistas de p\u00E1del"
If I remove the accent everything has tokens_matched, but the results are wildly different.
"q": "Pistas de padel"
Kishore Nallan
11/08/2024, 4:56 PMKishore Nallan
11/09/2024, 1:57 PMÓscar Vicente
11/18/2024, 9:18 AMÓscar Vicente
11/18/2024, 9:18 AMÓscar Vicente
11/22/2024, 3:21 PMKishore Nallan
11/25/2024, 8:23 AMKishore Nallan
11/27/2024, 7:03 AM"tokens_matched": 0
Óscar Vicente
11/27/2024, 7:58 AMÓscar Vicente
11/27/2024, 8:26 AMÓscar Vicente
11/27/2024, 8:26 AM"q": "Pistas de pádel",
to "q": "Pistas de padel",
it no longer happens. It happens with the second document, the one with the field "externalReference": "0043108-24"
Kishore Nallan
11/27/2024, 10:22 AMÓscar Vicente
11/27/2024, 10:23 AMKishore Nallan
11/27/2024, 10:26 AMÓscar Vicente
11/27/2024, 11:11 AMKishore Nallan
11/28/2024, 8:28 AM28.0.rc23
-- please check and confirm.Óscar Vicente
11/28/2024, 8:28 AMÓscar Vicente
11/28/2024, 4:14 PMKishore Nallan
11/29/2024, 2:35 AMes
locale which preserves the diacritics. So the word in the query and in the document are treated differently. The fields_matched
now matches the exact number of unique highlighted words. This is the best we can do in a second pass re-ranking.Óscar Vicente
11/29/2024, 8:21 AMKishore Nallan
11/29/2024, 8:22 AMes
locale, diacritics are removed.Óscar Vicente
11/29/2024, 8:23 AMKishore Nallan
11/29/2024, 8:23 AMKishore Nallan
11/29/2024, 8:23 AMÓscar Vicente
11/29/2024, 8:25 AMÓscar Vicente
11/29/2024, 8:35 AMKishore Nallan
11/29/2024, 8:54 AMÓscar Vicente
11/29/2024, 9:35 AMKishore Nallan
11/29/2024, 9:35 AMÓscar Vicente
11/29/2024, 9:37 AMKishore Nallan
11/29/2024, 9:37 AMÓscar Vicente
11/29/2024, 9:37 AMÓscar Vicente
11/29/2024, 9:38 AMÓscar Vicente
11/29/2024, 9:38 AMÓscar Vicente
11/29/2024, 9:39 AM