Phrase Search Relevancy and Weights Fix
TLDR Jan reported an issue with phrase search relevancy using Typesense Instantsearch Adapter. The problem occurred when searching phrases with double quotes. The team identified the issue to be related to weights and implemented a fix, improving the search results.
5
1
Mar 08, 2023 (9 months ago)
Jason
05:02 PMJason
05:03 PMJan
05:03 PMJason
05:04 PMJason
05:05 PM> When I do the same phrase search (using double quotes) in the cloud.typesense.org/clusters/ interface it’s also not returning records as expected. What’s your advice on this?
I misread this as it IS returning correct results in Typesense Cloud, but not in your app
Jason
05:05 PMJason
05:06 PMJan
05:07 PMJan
05:07 PMJan
05:08 PMJan
05:09 PMJason
05:09 PMJason
05:10 PMJan
05:10 PMcurl '' \
-H 'authority: ' \
-H 'accept: application/json, text/plain, */*' \
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'cache-control: no-cache' \
-H 'content-type: text/plain' \
-H 'origin: https://testapp.digibeetle.eu' \
-H 'pragma: no-cache' \
-H 'referer: https://testapp.digibeetle.eu/' \
-H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'sec-fetch-dest: empty' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-site: cross-site' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \
--data-raw '{"searches":[{"query_by":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","query_by_weights":"200,500,100,50,250,50,50,25,75,75,50,50,50,50,50,50,50,1,1,1","num_typos":0,"highlight_affix_num_tokens":"20","sort_by":"_text_match:desc,post_date:desc","highlight_full_fields":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","collection":"sources","q":"\"state of the art\"","facet_by":"key_topics,relevant_gdpr_articles,document_types,sectors,document_categories,document_status,type_of_bcr,competent_supervisory_authority_bcr_lead,case_law_case_status,case_law_case_stage,outcomes_of_the_procedure,type_of_procedure,advocate_general_name,judge_rapporteur,chamber,post_date,source_types,source_abbreviation","max_facet_values":10,"page":1,"per_page":10}]}' \
--compressed
Jan
05:11 PMJan
05:11 PMcurl '' \
-H 'authority: ' \
-H 'accept: application/json, text/plain, */*' \
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'cache-control: no-cache' \
-H 'content-type: text/plain' \
-H 'origin: https://testapp.digibeetle.eu' \
-H 'pragma: no-cache' \
-H 'referer: https://testapp.digibeetle.eu/' \
-H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'sec-fetch-dest: empty' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-site: cross-site' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \
--data-raw '{"searches":[{"query_by":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","query_by_weights":"200,500,100,50,250,50,50,25,75,75,50,50,50,50,50,50,50,1,1,1","num_typos":0,"highlight_affix_num_tokens":"20","sort_by":"_text_match:desc,post_date:desc","highlight_full_fields":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","collection":"sources","q":"\" state of the art\"","facet_by":"key_topics,relevant_gdpr_articles,document_types,sectors,document_categories,document_status,type_of_bcr,competent_supervisory_authority_bcr_lead,case_law_case_status,case_law_case_stage,outcomes_of_the_procedure,type_of_procedure,advocate_general_name,judge_rapporteur,chamber,post_date,source_types,source_abbreviation","max_facet_values":10,"page":1,"per_page":10}]}' \
--compressed
1
Jason
05:14 PMJan
05:14 PM1
Jan
05:20 PMJason
05:20 PM1
Jan
05:22 PMJason
05:25 PMJason
05:25 PMJason
05:28 PM"query_by_weights": "127,127,100,50,127,50,50,25,75,75,50,50,50,50,50,50,50,1,1,1",
Jason
05:29 PMJan
05:29 PMJan
05:29 PMJan
05:31 PMJan
05:32 PMJan
05:33 PMcurl '' \
-H 'authority: ' \
-H 'accept: application/json, text/plain, */*' \
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'cache-control: no-cache' \
-H 'content-type: text/plain' \
-H 'origin: https://testapp.digibeetle.eu' \
-H 'pragma: no-cache' \
-H 'referer: https://testapp.digibeetle.eu/' \
-H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'sec-fetch-dest: empty' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-site: cross-site' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \
--data-raw '{"searches":[{"query_by":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","query_by_weights":"127,127,100,50,127,50,50,25,75,75,50,50,50,50,50,50,50,1,1,1","num_typos":0,"highlight_affix_num_tokens":"20","sort_by":"_text_match:desc,post_date:desc","highlight_full_fields":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","collection":"sources","q":"\"state of the art\"","facet_by":"key_topics,relevant_gdpr_articles,document_types,sectors,document_categories,document_status,type_of_bcr,competent_supervisory_authority_bcr_lead,case_law_case_status,case_law_case_stage,outcomes_of_the_procedure,type_of_procedure,advocate_general_name,judge_rapporteur,chamber,post_date,source_types,source_abbreviation","max_facet_values":10,"page":1,"per_page":10}]}' \
--compressed
Jan
05:34 PMJason
05:36 PM1
Jan
05:39 PM1
Jan
05:47 PM1
Mar 09, 2023 (9 months ago)
Jan
09:28 AMKishore Nallan
12:30 PMKishore Nallan
03:53 PMMar 10, 2023 (9 months ago)
Kishore Nallan
11:47 AMMar 13, 2023 (9 months ago)
Jan
10:20 AMKishore Nallan
10:35 AMJan
10:51 AMKishore Nallan
11:00 AMJan
11:02 AMKishore Nallan
11:04 AMKishore Nallan
11:04 AMcurl '' --data-raw '{"searches":[{"query_by":"case_law_documents","sort_by":"_text_match:desc,post_date:desc","collection":"sources","q":"\"state of the art\"","per_page":10, "highlight_fields": "case_law_documents", "include_fields": "id"}]}' | jq
Kishore Nallan
11:04 AMJan
11:07 AMJan
11:12 AMtext_match_info.score
tells us anything about the relevance of the hits while using a phrase (double quoted) query? Because in the cloud and our app we get scores of 100, while the query with a prepended space in the query gives us text_match_scores that are way higher (e.g. text_match_info.score: 2314894167593451644
)Kishore Nallan
11:16 AMJan
11:42 AMcurl 'https://cloud.typesense.org/clusters/ocpdr54qif7a3tb0p/api/multi_search' \
-H 'authority: ' \
-H 'accept: application/json, text/plain, */*' \
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'cache-control: no-cache' \
-H 'content-type: text/plain' \
-H 'cookie: _gcl_aw=GCL.1671042832.CjwKCAiAheacBhB8EiwAItVO23gnt1Leqk8-8BYwLubtO9k8e2FfBfyTot8gfc9tXYWXeegNH8Pf_RoCgaUQAvD_BwE; _gcl_au=1.1.2056632186.1671042832; _gac_UA-116415641-1=1.1671042832.CjwKCAiAheacBhB8EiwAItVO23gnt1Leqk8-8BYwLubtO9k8e2FfBfyTot8gfc9tXYWXeegNH8Pf_RoCgaUQAvD_BwE; __stripe_mid=67095200-c2fc-4b42-889f-8ac824a73822c95db9; _gac_UA-116415641-2=1.1671042832.CjwKCAiAheacBhB8EiwAItVO23gnt1Leqk8-8BYwLubtO9k8e2FfBfyTot8gfc9tXYWXeegNH8Pf_RoCgaUQAvD_BwE; _ga_XTFPJRM8H9=GS1.1.1673438604.3.0.1673438604.0.0.0; _ga=GA1.2.853002192.1671042832; _gid=GA1.2.634655676.1678705359; _dc_gtm_UA-116415641-1=1; __stripe_sid=268480af-be78-45ce-b9d4-c17b80f5cc1eea59c6; _typesense_cloud_app_session=ERv8bWisgxuGfRaMHP%2FIjcXxj753nfkzx%2F0gY3eRs5zbw3eZL8e8DEm%2BpR5pwRxAKxwNPg9611knkD5JAYOKfOcaWH6N5EwjELCFtmf2e1tWYz8goYHna%2FMA1to%2FgTaPnEbal%2BC40i81sGiiKiQorGiKGdUGB0C0sLfYEiI2w1HiCzuPJL4AWmZMYTTIv32pJlADdyu9OY0txz28jDUk41Ac2Z6GWrTj%2FHDy9jFpEPXeGz4x1uA6pfFwdrnJ4c0qNT3wH41%2B8%2FnQ%2BonBVDd%2FfF%2BjbySRHprCttBHh%2BMVMOk87EW7UouFRfkF9yK8nG8akO7wPlhiUSlhc8uyz8cPjEx24jVvQgvq%2B5YbEVyMR6VjfDrc6v8V2G9fpVhmUuOYvnvIww8FafGpBbkJbiQfjzE1CocLXx73wiP%2BPNfL%2Bw5s%2BqkeFrdOyxXd2RkPKDzQs%2F1nVO4y6AD1Ps47ZWTxqQ6IXQHB17tqtbKPFhBL254h%2BM%2Fpv8zKXkBu0LGShzYTxoFwnMOQLKIBptSFzVbYHyiTB8EhGkGjX0a2bpLkKq9m7qBLNRK4YOENw0EbaoVZMDBA4QlMm9I%3D--ndZv%2FxfNntvyU3R%2B--hm%2FIrIn7YLqDjuw58R0%2B3Q%3D%3D' \
-H 'origin: https://cloud.typesense.org' \
-H 'pragma: no-cache' \
-H 'referer: https://cloud.typesense.org/clusters/ocpdr54qif7a3tb0p/collections/sources/documents/search' \
-H 'sec-ch-ua: "Google Chrome";v="111", "Not(A:Brand";v="8", "Chromium";v="111"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'sec-fetch-dest: empty' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-site: same-origin' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36' \
-H 'x-csrf-token: kAR1Yt8jPXq1PyOwGH6cPu7EL80Qpz7gUOKDHveJrWgtSy-gD1WMDuacdFI06KPdE7xgVq-PSAEzfw37vpHWQA' \
-H 'x-typesense-api-key: iyplt0v67zms5uhdk3w8gxcae4oqbrfn' \
--data-raw '{"searches":[{"preset":"Main preset like testapp","collection":"sources","q":"\"state of the art\"","page":1,"per_page":5}]}' \
--compressed
Kishore Nallan
11:44 AMKishore Nallan
01:04 PMKishore Nallan
01:05 PM"general_documents": [
{
"matched_tokens": [
"state",
"of",
"the",
"art",
"of",
"the"
],
"snippet": "based on the present <mark>state</mark> <mark>of</mark> <mark>the</mark> <mark>art</mark> <mark>of</mark> <mark>the</mark> Internet, which"
}
],
We do highlight all tokens appearing in the query in the snippet, but the phrase
state of the art
does exist.Jan
02:59 PMJan
03:12 PMKishore Nallan
03:17 PMJan
03:18 PMKishore Nallan
03:18 PMKishore Nallan
03:19 PMKishore Nallan
03:23 PMKishore Nallan
03:25 PMJan
03:30 PMquery_by_weight
in combination with text_match:desc
sortingJan
03:30 PMMar 14, 2023 (9 months ago)
Jan
08:49 AMFirst example: There is a record (
ID 4580
) with an attribute post_title
and the value EDPB Guidelines 4/2019
If you query this value title without double quotes, we get good results, meaning: this is the first hit we see. If you use phrase search this record shows as the third result even though the post_title
has query_by_weights: 115
Which is the third-highest weight that we’ve used. But… if you phrase search with a space prepended like " EDPB Guidelines 4/2019"
, the results are like we’d expect.Second example: The same record (
ID 4580
) has an attribute key_topics
and an (array) value containing “state of the art”. This attribute has query_by_weights: 127
(the highest weight we’ve used and we’ve used this weight only for this attribute). This same record also has an attribute called labels
containing an array with the value “state of the art”, this label attribute has query_by_weights: 120
If you use phrase search this record by
"state of the art"
this records ends up somewhere on ranking-position 50. Again… if you phrase search with a space prepended like " state of the art"
, the results are like we’d expect.Kishore Nallan
08:52 AMJan
08:55 AMMar 15, 2023 (9 months ago)
Jan
09:08 AMKishore Nallan
10:38 AMJan
11:22 AMMar 21, 2023 (9 months ago)
Kishore Nallan
06:46 AMMar 22, 2023 (9 months ago)
Jan
09:11 AMJan
09:13 AMKishore Nallan
09:53 AMJan
11:13 AMKishore Nallan
11:37 AMJan
12:24 PMKishore Nallan
12:33 PMJan
02:11 PMJan
02:17 PMJan
02:19 PMJan
02:20 PMKishore Nallan
02:28 PMKishore Nallan
02:47 PMJan
03:27 PMKishore Nallan
04:19 PMsymbols_to_index
configuration. This results in an empty query string which is treated as a *
wildcard search. Those "undefined" values are showing up because we don't return highlights for wildcard searches since the query is essentially empty / catch-all.Jason
04:41 PMhighlight
key, if it doesn’t then fallback to the field inside the document
key…Mar 23, 2023 (9 months ago)
Jan
10:23 AMJason
04:12 PMMar 29, 2023 (8 months ago)
Jan
10:43 AMhelper.state.query === ''
) is related to wildcard searches, it just hides something if the searchbox is empty. A user can still type in special characters and we get the ‘Undefined’ results. I have played around with this helper.state.query before, but it seems only useful in specific cases when you only want to listen to the state of the searchbox. This helper does not account for the use of facets/checkbox-filter states. Anyway, that is something not related to TypesenseKishore Nallan
10:51 AMJan
10:53 AMJan
11:00 AMq
in the payload. So q
is not empty when it’s sent to Typesense right? Typesense will interpret this as a empty q string, but it will returns all results from the index…. I want to prevent this last thing from happening.Kishore Nallan
11:02 AMJan
11:04 AMJan
11:05 AM&
for example, nothing happens -> no payload… what part of the code is preventing this from happening?Kishore Nallan
11:07 AM/^[^a-zA-Z0-9]+$/.test(helper.state.query)
to check if query string contains only alpha numeric? That can be added along with the empty string check here: https://github.com/typesense/showcase-songs-search/blob/e7ad97ce4e09191743abd727c2dfc949811bbcd6/src/app.js#L176Jan
11:09 AMKishore Nallan
11:13 AMfoo? bar
Kishore Nallan
11:17 AM/[a-zA-Z0-9]/.test(helper.state.query)
Kishore Nallan
11:18 AMtrue
if atleast one alpha numeric character appears in the query string, which is what we want here.Typesense
Indexed 3015 threads (79% resolved)
Similar Threads
Inconsistent Search Results with Typesense
David reported inconsistencies with infix searching using Typesense, despite no change in configuration. Upon review, Jason could not consistently reproduce the issue and suggested potential fixes including a debug build on the user's cluster. The issue remains unresolved.
Resolving Typesense Query Issues
Todd had queries regarding Typesense operation. Jason clarified Typesense's default behavior and provided a recommendation to enhance results ranking based on relevance and recency.
Querying and Indexing Multiple Elements Issues
Krish queried fields with multiple elements, which Kishore Nallan suggested checking `drop_tokens_threshold`. Krish wished to force OR mode for token, but Kishore Nallan admitted the feature was missing. Krish was able to resolve the issue with url encoding.
Issues With `text_match` Scoring for Search Queries in Typesense
Colin encountered issues with the `text_match` scoring on Typesense v0.23.1. Jason and Kishore Nallan identified a potential issue with numeric overflow in the text match score and applied an unverified patch. The final resolution is unclear.
Troubleshooting Issues with DocSearch Hits and Scraper Configuration
Rubai encountered issues with search result priorities and ellipsis. Jason helped debug the issue and suggested using different versions of typesense-docsearch.js, updating initialization parameters, and running the scraper on a Linux-based environment. The issues related to hits structure and scraper configuration were resolved.