Hi there I m trying to make the phrase match work with doubl typesense #community-help

Hi there! I’m trying to make the phrase match work...

Jan Willem Hoogstraten

03/02/2023, 2:57 PM

Hi there! I’m trying to make the phrase match work with double quotes. We are using Typesense Instantsearch Adapter but it doesn’t work as expected. It’s not showing records that actually DO contain the exact phrase, it only works as expected when the phrase match is matching the complete value of an attribute. When I do the same phrase search (using double quotes) in the cloud.typesense.org/clusters/ interface it’s also not returning records as expected. What’s your advice on this? Update: if I prefix the token with a space like this

" this is my phrase"

then the results are showing like expected.

Jason Bosco

03/02/2023, 4:39 PM

Hmm, you can see the difference in queries sent to Typesense by looking at the network requests in your dev console in both Typesense Cloud and in your app…

Jason Bosco

03/02/2023, 4:39 PM

Could you post screenshots of the search parameters being sent in the network requests, and I can help explain the difference

Jan Willem Hoogstraten

03/08/2023, 11:37 AM

Extremely sorry for my late reply. I’m back on this project again and will reply right away on your follow ups! Do you mean the Network tab > multi_search request > response tab?

Jan Willem Hoogstraten

03/08/2023, 11:45 AM

Query with “state of the art” between quotes

Copy code

"request_params":{"collection_name":"sources","per_page":10,"q":"\"state of the art\""},

Query with a space prepended like ” state of the art”

Copy code

"request_params":{"collection_name":"sources","per_page":10,"q":"\" state of the art\""}

Jason Bosco

03/08/2023, 4:15 PM

Do you mean the Network tab > multi_search request > response tab?

No, Network tab > multi_search request > Payload tab

Jan Willem Hoogstraten

03/08/2023, 5:01 PM

Query with “state of the art” between quotes

Jason Bosco

03/08/2023, 5:02 PM

Actually, could you copy-paste that here? That’ll be easier

Jan Willem Hoogstraten

03/08/2023, 5:02 PM

Copy code

{
  "searches": [
    {
      "query_by": "post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents",
      "query_by_weights": "200,500,100,50,250,50,50,25,75,75,50,50,50,50,50,50,50,1,1,1",
      "num_typos": 0,
      "highlight_affix_num_tokens": "20",
      "sort_by": "_text_match:desc,post_date:desc",
      "highlight_full_fields": "post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents",
      "collection": "sources",
      "q": "\"state of the art\"",
      "facet_by": "key_topics,relevant_gdpr_articles,document_types,sectors,document_categories,document_status,type_of_bcr,competent_supervisory_authority_bcr_lead,case_law_case_status,case_law_case_stage,outcomes_of_the_procedure,type_of_procedure,advocate_general_name,judge_rapporteur,chamber,post_date,source_types,source_abbreviation",
      "max_facet_values": 10,
      "page": 1,
      "per_page": 10
    }
  ]
}

Jason Bosco

03/08/2023, 5:02 PM

And this is in Typesense Cloud?

Jan Willem Hoogstraten

03/08/2023, 5:02 PM

Jason Bosco

03/08/2023, 5:02 PM

from your app, ok cool

Jason Bosco

03/08/2023, 5:03 PM

Could you also run the same search from Typesense Cloud search UI and paste the payload?

Jan Willem Hoogstraten

03/08/2023, 5:03 PM

With the same query_by and facet settings ?

Jason Bosco

03/08/2023, 5:04 PM

Yup

Jason Bosco

03/08/2023, 5:05 PM

Oh wait, I just noticed I misread your original question:

When I do the same phrase search (using double quotes) in the cloud.typesense.org/clusters/ interface it’s also not returning records as expected. What’s your advice on this?

I misread this as it IS returning correct results in Typesense Cloud, but not in your app

Jason Bosco

03/08/2023, 5:05 PM

So that’s why I had asked you to send me the payload sent by your app vs Typesense Cloud

Jason Bosco

03/08/2023, 5:06 PM

Sorry about the confusion. Could you right click on the network request in the browser dev console, click on copy-as-curl and paste that curl command here?

Jan Willem Hoogstraten

03/08/2023, 5:07 PM

Yes the same results in the app as on cloud

Jan Willem Hoogstraten

03/08/2023, 5:07 PM

Yes I will grab that CURL

Jan Willem Hoogstraten

03/08/2023, 5:08 PM

Sorry just to be clear: the cloud and the app are showing the same results.

Jan Willem Hoogstraten

03/08/2023, 5:09 PM

It’s just that the relevance of the results are off when using “double quotes”. But if you do the same query with a prepended space like: ” double quotes” then we see results that are like expected/relevant

Jason Bosco

03/08/2023, 5:09 PM

Yup, understood

Jason Bosco

03/08/2023, 5:10 PM

If you can generate the curl command from the network request sent by your app, I can then take a closer look

Jan Willem Hoogstraten

03/08/2023, 5:10 PM

Copy code

curl '<https://ocpdr54qif7a3tb0p-1.a1.typesense.net/multi_search?x-typesense-api-key=meSE8pinMxwFJlECsmTRVMLCbrrzoL2R>' \
  -H 'authority: <http://ocpdr54qif7a3tb0p-1.a1.typesense.net|ocpdr54qif7a3tb0p-1.a1.typesense.net>' \
  -H 'accept: application/json, text/plain, */*' \
  -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
  -H 'cache-control: no-cache' \
  -H 'content-type: text/plain' \
  -H 'origin: <https://testapp.digibeetle.eu>' \
  -H 'pragma: no-cache' \
  -H 'referer: <https://testapp.digibeetle.eu/>' \
  -H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: cross-site' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \
  --data-raw '{"searches":[{"query_by":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","query_by_weights":"200,500,100,50,250,50,50,25,75,75,50,50,50,50,50,50,50,1,1,1","num_typos":0,"highlight_affix_num_tokens":"20","sort_by":"_text_match:desc,post_date:desc","highlight_full_fields":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","collection":"sources","q":"\"state of the art\"","facet_by":"key_topics,relevant_gdpr_articles,document_types,sectors,document_categories,document_status,type_of_bcr,competent_supervisory_authority_bcr_lead,case_law_case_status,case_law_case_stage,outcomes_of_the_procedure,type_of_procedure,advocate_general_name,judge_rapporteur,chamber,post_date,source_types,source_abbreviation","max_facet_values":10,"page":1,"per_page":10}]}' \
  --compressed

Jan Willem Hoogstraten

03/08/2023, 5:11 PM

And this is the query with a prepended space like ” state of the art” :

Jan Willem Hoogstraten

03/08/2023, 5:11 PM

Copy code

curl '<https://ocpdr54qif7a3tb0p-1.a1.typesense.net/multi_search?x-typesense-api-key=meSE8pinMxwFJlECsmTRVMLCbrrzoL2R>' \
  -H 'authority: <http://ocpdr54qif7a3tb0p-1.a1.typesense.net|ocpdr54qif7a3tb0p-1.a1.typesense.net>' \
  -H 'accept: application/json, text/plain, */*' \
  -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
  -H 'cache-control: no-cache' \
  -H 'content-type: text/plain' \
  -H 'origin: <https://testapp.digibeetle.eu>' \
  -H 'pragma: no-cache' \
  -H 'referer: <https://testapp.digibeetle.eu/>' \
  -H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: cross-site' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \
  --data-raw '{"searches":[{"query_by":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","query_by_weights":"200,500,100,50,250,50,50,25,75,75,50,50,50,50,50,50,50,1,1,1","num_typos":0,"highlight_affix_num_tokens":"20","sort_by":"_text_match:desc,post_date:desc","highlight_full_fields":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","collection":"sources","q":"\" state of the art\"","facet_by":"key_topics,relevant_gdpr_articles,document_types,sectors,document_categories,document_status,type_of_bcr,competent_supervisory_authority_bcr_lead,case_law_case_status,case_law_case_stage,outcomes_of_the_procedure,type_of_procedure,advocate_general_name,judge_rapporteur,chamber,post_date,source_types,source_abbreviation","max_facet_values":10,"page":1,"per_page":10}]}' \
  --compressed

👍 1

Jason Bosco

03/08/2023, 5:14 PM

Looks like you’re running 0.23.1. Can we try upgrade you to the latest version to see if some of the fixes we have there help with your dataset?

Jan Willem Hoogstraten

03/08/2023, 5:14 PM

yes please

👍 1

Jan Willem Hoogstraten

03/08/2023, 5:20 PM

Do we need to re-sync the collection index?

Jason Bosco

03/08/2023, 5:20 PM

No, not necessary

👍 1

Jan Willem Hoogstraten

03/08/2023, 5:22 PM

Should we test again?

Jason Bosco

03/08/2023, 5:25 PM

I was just testing after the upgrade… Looks like the issue still persists

Jason Bosco

03/08/2023, 5:25 PM

Taking a closer look

Jason Bosco

03/08/2023, 5:28 PM

Could you try setting the weights to this:

Copy code

"query_by_weights": "127,127,100,50,127,50,50,25,75,75,50,50,50,50,50,50,50,1,1,1",

Jason Bosco

03/08/2023, 5:29 PM

Weights can only go up to a max of 127, beyond that it causes overflow and I wonder if that’s causing issues

Jan Willem Hoogstraten

03/08/2023, 5:29 PM

ah alright! sorry about that

Jan Willem Hoogstraten

03/08/2023, 5:29 PM

checking it right now

Jan Willem Hoogstraten

03/08/2023, 5:31 PM

No effect unfortunately

Jan Willem Hoogstraten

03/08/2023, 5:32 PM

One sec I will double test this again

Jan Willem Hoogstraten

03/08/2023, 5:33 PM

Copy code

curl '<https://ocpdr54qif7a3tb0p-1.a1.typesense.net/multi_search?x-typesense-api-key=meSE8pinMxwFJlECsmTRVMLCbrrzoL2R>' \
  -H 'authority: <http://ocpdr54qif7a3tb0p-1.a1.typesense.net|ocpdr54qif7a3tb0p-1.a1.typesense.net>' \
  -H 'accept: application/json, text/plain, */*' \
  -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
  -H 'cache-control: no-cache' \
  -H 'content-type: text/plain' \
  -H 'origin: <https://testapp.digibeetle.eu>' \
  -H 'pragma: no-cache' \
  -H 'referer: <https://testapp.digibeetle.eu/>' \
  -H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: cross-site' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \
  --data-raw '{"searches":[{"query_by":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","query_by_weights":"127,127,100,50,127,50,50,25,75,75,50,50,50,50,50,50,50,1,1,1","num_typos":0,"highlight_affix_num_tokens":"20","sort_by":"_text_match:desc,post_date:desc","highlight_full_fields":"post_title,key_topics,shorthand,full_source_title,labels,relevant_gdpr_recitals,directive_95_46_ec_equivalent,post_content,answered_questions,prelim_qs_referred_or_pleas_in_law,data_categories,data_subject_categories,organisation_focus,sectors,party_a,party_b,party_c,case_law_doc_celex_id,case_law_documents,general_documents","collection":"sources","q":"\"state of the art\"","facet_by":"key_topics,relevant_gdpr_articles,document_types,sectors,document_categories,document_status,type_of_bcr,competent_supervisory_authority_bcr_lead,case_law_case_status,case_law_case_stage,outcomes_of_the_procedure,type_of_procedure,advocate_general_name,judge_rapporteur,chamber,post_date,source_types,source_abbreviation","max_facet_values":10,"page":1,"per_page":10}]}' \
  --compressed

Jan Willem Hoogstraten

03/08/2023, 5:34 PM

I get the same results

Jason Bosco

03/08/2023, 5:36 PM

Ok, thank you for checking. We’ll take a closer look later today and keep you posted.

🙌 1

Jan Willem Hoogstraten

03/08/2023, 5:39 PM

meanwhile I will double check the results in Typesense Cloud again

👍 1

Jan Willem Hoogstraten

03/08/2023, 5:47 PM

I compared our app and the Cloud again using the same query_by and querry_by_weight settings. The results in Cloud are (almost) the same when searching the phrase match with a space prepended in the double quoted query. They are like we expect them to be in terms of relevance. But the results are different (and not like we expect them to be) without the space in the phrase query.

👍 1

Jan Willem Hoogstraten

03/09/2023, 9:28 AM

@Jason Bosco I presume there is no update on this yet ?

Kishore Nallan

03/09/2023, 12:30 PM

I will be looking into this issue today. Will update.

Kishore Nallan

03/09/2023, 3:53 PM

I've identified the issue and will work on a patch for this. I will keep you posted.

Kishore Nallan

03/10/2023, 11:47 AM

@Jan Willem Hoogstraten I've a fix for this problem. Can we update your cluster to the version with the fix? Let me know if we can go ahead and do that (or you prefer a particular time to do that).

Jan Willem Hoogstraten

03/13/2023, 10:20 AM

Hi @Kishore C You can update the cluster, thanks!

Kishore Nallan

03/13/2023, 10:35 AM

Done, please check again

Jan Willem Hoogstraten

03/13/2023, 10:51 AM

No difference

Kishore Nallan

03/13/2023, 11:00 AM

Hmm, let me look. I did test locally on a similar document that reproduced the issue.

Jan Willem Hoogstraten

03/13/2023, 11:02 AM

Let me know if I can provide you with anything that might help, or maybe I need to refresh/make changes on our side. I’ll re-test this on the cloud as well now, but in our app we don’t see any changes in the results.

Kishore Nallan

03/13/2023, 11:04 AM

Ok see this query:

Kishore Nallan

03/13/2023, 11:04 AM

Copy code

curl '<https://ocpdr54qif7a3tb0p-1.a1.typesense.net/multi_search?x-typesense-api-key=meSE8pinMxwFJlECsmTRVMLCbrrzoL2R>' --data-raw '{"searches":[{"query_by":"case_law_documents","sort_by":"_text_match:desc,post_date:desc","collection":"sources","q":"\"state of the art\"","per_page":10, "highlight_fields": "case_law_documents", "include_fields": "id"}]}' | jq

Kishore Nallan

03/13/2023, 11:04 AM

Earlier, this was producing hits that did not have the full phrase.

Jan Willem Hoogstraten

03/13/2023, 11:07 AM

But that query is only querying 1 attribute

Jan Willem Hoogstraten

03/13/2023, 11:12 AM

Btw, does the

text_match_info.score

tells us anything about the relevance of the hits while using a phrase (double quoted) query? Because in the cloud and our app we get scores of 100, while the query with a prepended space in the query gives us text_match_scores that are way higher (e.g.

text_match_info.score: 2314894167593451644

)

Kishore Nallan

03/13/2023, 11:16 AM

For phrase search since all documents have the exact match, it's just a constant. The match info is misleading there. We should fix it.

Jan Willem Hoogstraten

03/13/2023, 11:42 AM

Thanks for clearing that up, do you need more information from me to be able to reproduce the relevance problem we experience when querying with phrase search using this curl:

Copy code

curl '<https://cloud.typesense.org/clusters/ocpdr54qif7a3tb0p/api/multi_search>' \
  -H 'authority: <http://cloud.typesense.org|cloud.typesense.org>' \
  -H 'accept: application/json, text/plain, */*' \
  -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
  -H 'cache-control: no-cache' \
  -H 'content-type: text/plain' \
  -H 'cookie: _gcl_aw=GCL.1671042832.CjwKCAiAheacBhB8EiwAItVO23gnt1Leqk8-8BYwLubtO9k8e2FfBfyTot8gfc9tXYWXeegNH8Pf_RoCgaUQAvD_BwE; _gcl_au=1.1.2056632186.1671042832; _gac_UA-116415641-1=1.1671042832.CjwKCAiAheacBhB8EiwAItVO23gnt1Leqk8-8BYwLubtO9k8e2FfBfyTot8gfc9tXYWXeegNH8Pf_RoCgaUQAvD_BwE; __stripe_mid=67095200-c2fc-4b42-889f-8ac824a73822c95db9; _gac_UA-116415641-2=1.1671042832.CjwKCAiAheacBhB8EiwAItVO23gnt1Leqk8-8BYwLubtO9k8e2FfBfyTot8gfc9tXYWXeegNH8Pf_RoCgaUQAvD_BwE; _ga_XTFPJRM8H9=GS1.1.1673438604.3.0.1673438604.0.0.0; _ga=GA1.2.853002192.1671042832; _gid=GA1.2.634655676.1678705359; _dc_gtm_UA-116415641-1=1; __stripe_sid=268480af-be78-45ce-b9d4-c17b80f5cc1eea59c6; _typesense_cloud_app_session=ERv8bWisgxuGfRaMHP%2FIjcXxj753nfkzx%2F0gY3eRs5zbw3eZL8e8DEm%2BpR5pwRxAKxwNPg9611knkD5JAYOKfOcaWH6N5EwjELCFtmf2e1tWYz8goYHna%2FMA1to%2FgTaPnEbal%2BC40i81sGiiKiQorGiKGdUGB0C0sLfYEiI2w1HiCzuPJL4AWmZMYTTIv32pJlADdyu9OY0txz28jDUk41Ac2Z6GWrTj%2FHDy9jFpEPXeGz4x1uA6pfFwdrnJ4c0qNT3wH41%2B8%2FnQ%2BonBVDd%2FfF%2BjbySRHprCttBHh%2BMVMOk87EW7UouFRfkF9yK8nG8akO7wPlhiUSlhc8uyz8cPjEx24jVvQgvq%2B5YbEVyMR6VjfDrc6v8V2G9fpVhmUuOYvnvIww8FafGpBbkJbiQfjzE1CocLXx73wiP%2BPNfL%2Bw5s%2BqkeFrdOyxXd2RkPKDzQs%2F1nVO4y6AD1Ps47ZWTxqQ6IXQHB17tqtbKPFhBL254h%2BM%2Fpv8zKXkBu0LGShzYTxoFwnMOQLKIBptSFzVbYHyiTB8EhGkGjX0a2bpLkKq9m7qBLNRK4YOENw0EbaoVZMDBA4QlMm9I%3D--ndZv%2FxfNntvyU3R%2B--hm%2FIrIn7YLqDjuw58R0%2B3Q%3D%3D' \
  -H 'origin: <https://cloud.typesense.org>' \
  -H 'pragma: no-cache' \
  -H 'referer: <https://cloud.typesense.org/clusters/ocpdr54qif7a3tb0p/collections/sources/documents/search>' \
  -H 'sec-ch-ua: "Google Chrome";v="111", "Not(A:Brand";v="8", "Chromium";v="111"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-origin' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36' \
  -H 'x-csrf-token: kAR1Yt8jPXq1PyOwGH6cPu7EL80Qpz7gUOKDHveJrWgtSy-gD1WMDuacdFI06KPdE7xgVq-PSAEzfw37vpHWQA' \
  -H 'x-typesense-api-key: iyplt0v67zms5uhdk3w8gxcae4oqbrfn' \
  --data-raw '{"searches":[{"preset":"Main preset like testapp","collection":"sources","q":"\"state of the art\"","page":1,"per_page":5}]}' \
  --compressed

Kishore Nallan

03/13/2023, 11:44 AM

Thanks, let me analyze and get back to you.

Kishore Nallan

03/13/2023, 1:04 PM

The output of that query contains the phrase matches correctly. Can you tell me what issue you are seeing, maybe I am missing something.

Kishore Nallan

03/13/2023, 1:05 PM

First result:

Copy code

"general_documents": [
              {
                "matched_tokens": [
                  "state",
                  "of",
                  "the",
                  "art",
                  "of",
                  "the"
                ],
                "snippet": "based on the present <mark>state</mark> <mark>of</mark> <mark>the</mark> <mark>art</mark> <mark>of</mark> <mark>the</mark> Internet, which"
              }
            ],

We do highlight all tokens appearing in the query in the snippet, but the phrase

state of the art

does exist.

Jan Willem Hoogstraten

03/13/2023, 2:59 PM

When we prepend the phrase search with a space ” state of the art” the results are more relevant (and completely different) compared to “state of the art”

Jan Willem Hoogstraten

03/13/2023, 3:12 PM

It’s not that we don’t see results that have matches, the problem is that the relevance order is not like we’d expect.

Kishore Nallan

03/13/2023, 3:17 PM

I think that's just a coincidence. I have to check how that additional space is being treated by the engine, maybe that influences a different ordering.

Jan Willem Hoogstraten

03/13/2023, 3:18 PM

The results we see in the cloud using the preset (like in my last curl code), are the least relevant results if we don’t use a prepended space in the phrase search. If we add the space, the results are perfect.

Kishore Nallan

03/13/2023, 3:18 PM

With the space, 750 results are fetched, but without it, only 75.

Kishore Nallan

03/13/2023, 3:19 PM

The earlier problem with non-phrase matches being returned is fixed: this seems different and relevancy related. Will look again and get back to you.

Kishore Nallan

03/13/2023, 3:23 PM

The first result which you deem more relevant (id: 4580) when space is used, also occurs in the query without prefix space, but it occurs much later.

Kishore Nallan

03/13/2023, 3:25 PM

Would you able to describe to me why you find that more relevant? Because all results are documents that contain the exact phrase. So what makes some more relevant than others? I can use this as a guide to see how we can fix the ordering.

Jan Willem Hoogstraten

03/13/2023, 3:30 PM

I thought the order of the results should be dictated by the

query_by_weight

in combination with

text_match:desc

sorting

Jan Willem Hoogstraten

03/13/2023, 3:30 PM

But I will try to explain why we expect these results to rank high and why the other should be ranked low.

Jan Willem Hoogstraten

03/14/2023, 8:49 AM

If the search queries contain more than one word between quotation marks, then we get unexpected results. First example: There is a record (

ID 4580

) with an attribute

post_title

and the value

EDPB Guidelines 4/2019

If you query this value title without double quotes, we get good results, meaning: this is the first hit we see. If you use phrase search this record shows as the third result even though the

post_title

has

query_by_weights: 115

Which is the third-highest weight that we’ve used. But… if you phrase search with a space prepended like

" EDPB Guidelines 4/2019"

, the results are like we’d expect. Second example: The same record (

ID 4580

) has an attribute

key_topics

and an (array) value containing “state of the art”. This attribute has

query_by_weights: 127

(the highest weight we’ve used and we’ve used this weight only for this attribute). This same record also has an attribute called

labels

containing an array with the value “state of the art”, this label attribute has

query_by_weights: 120

If you use phrase search this record by

"state of the art"

this records ends up somewhere on ranking-position 50. Again… if you phrase search with a space prepended like

" state of the art"

, the results are like we’d expect.

Kishore Nallan

03/14/2023, 8:52 AM

Thanks for the detailed examples. I've been looking into this behaviour myself and I see that there are cases where the weight is being ignored in phrase search that is having an impact here. With a padded space, the search query is no longer being searched via the phrase search code path so that's why the weights work. I'm working on a fix. I'll keep you posted.

Jan Willem Hoogstraten

03/14/2023, 8:55 AM

Thanks!

Jan Willem Hoogstraten

03/15/2023, 9:08 AM

Hi Kishore, not to rush you but to manage expectations here on my side. Do you have an ETA for the phrase search fix?

Kishore Nallan

03/15/2023, 10:38 AM

I'm working on incorporating the weights properly into phrase search. It will take a few days to implement and then thoroughly test. So I should be able to get you a patched build by early next week.

Jan Willem Hoogstraten

03/15/2023, 11:22 AM

👌

Kishore Nallan

03/21/2023, 6:46 AM

👋 I have a build with a patch. Do you want to test it out first on a dev/staging environment first?

Jan Willem Hoogstraten

03/22/2023, 9:11 AM

Hi!

Jan Willem Hoogstraten

03/22/2023, 9:13 AM

You can roll it out to the current environment! Or was it already rolled out because we see that things changed for the better 👌🏽

Kishore Nallan

03/22/2023, 9:53 AM

Not yet, I can roll it out now. Since your instance is not a HA instance there will be a downtime. So let me know if you want to do it at a specific time.

Jan Willem Hoogstraten

03/22/2023, 11:13 AM

Hi, please roll out asap 👌🏽

Kishore Nallan

03/22/2023, 11:37 AM

It's done.

Jan Willem Hoogstraten

03/22/2023, 12:24 PM

Yes this looks perfect, I’m going to do some more testing but it seems like it’s working!

Kishore Nallan

03/22/2023, 12:33 PM

Happy to hear!

Jan Willem Hoogstraten

03/22/2023, 2:11 PM

Jan Willem Hoogstraten

03/22/2023, 2:17 PM

Not sure if this is related, but if we add any symbols in the searchbar then our document titles return with a ‘Undefined’.

Jan Willem Hoogstraten

03/22/2023, 2:19 PM

Like this, all results do show the correct date and they correctly link to the documents.

Jan Willem Hoogstraten

03/22/2023, 2:20 PM

If you start searching using double quote

Kishore Nallan

03/22/2023, 2:28 PM

Can you share the request being made for this query like before?

Kishore Nallan

03/22/2023, 2:47 PM

Ignore, I think we have to handle this. I will get back to you.

Jan Willem Hoogstraten

03/22/2023, 3:27 PM

Thanks

Kishore Nallan

03/22/2023, 4:19 PM

What's happening here is that the special characters are removed from the query string unless they are explicitly allowed via

symbols_to_index

configuration. This results in an empty query string which is treated as a

wildcard search. Those "undefined" values are showing up because we don't return highlights for wildcard searches since the query is essentially empty / catch-all.

Jason Bosco

03/22/2023, 4:41 PM

In general, when you process the response from Typesense to render the UI, you want to first check if a highlight exists for that field inside the

highlight

key, if it doesn’t then fallback to the field inside the

document

key…

Jan Willem Hoogstraten

03/23/2023, 10:23 AM

Thanks for your reply. Instead of showing all results on a wildcard search, how can I prevent a wildcard search all together? I’m using instantsearch.js

Jason Bosco

03/23/2023, 4:12 PM

Here’s how to do that with instantsearch.js: https://github.com/typesense/showcase-songs-search/blob/e7ad97ce4e09191743abd727c2dfc949811bbcd6/src/app.js#L175-L182

Jan Willem Hoogstraten

03/29/2023, 10:43 AM

Thanks Jason, but I don’t understand how this part (

helper.state.query === ''

) is related to wildcard searches, it just hides something if the searchbox is empty. A user can still type in special characters and we get the ‘Undefined’ results. I have played around with this helper.state.query before, but it seems only useful in specific cases when you only want to listen to the state of the searchbox. This helper does not account for the use of facets/checkbox-filter states. Anyway, that is something not related to Typesense

Kishore Nallan

03/29/2023, 10:51 AM

Typesense treats an empty q as a wildcard query as well. If there are only special characters in q then those are removed by tokenizer and we again end up with empty q string.

Jan Willem Hoogstraten

03/29/2023, 10:53 AM

Yes

Jan Willem Hoogstraten

03/29/2023, 11:00 AM

But how can I prevent from allowing ‘wild card’-results? If I type any special character I see that character as a value for the

in the payload. So

is not empty when it’s sent to Typesense right? Typesense will interpret this as a empty q string, but it will returns all results from the index…. I want to prevent this last thing from happening.

Kishore Nallan

03/29/2023, 11:02 AM

I regret allowing empty q to be treated as a wildcard 😞 The work around is to mimic Typesense behavior client side. Check if q contains all symbols in addition to checking if it's empty.

Jan Willem Hoogstraten

03/29/2023, 11:04 AM

Yes that’s what I’m looking for (client-side solution), I seems that the example of Jason is doing this (the songs-search demo), but I figure out what code is filtering these special characters.

Jan Willem Hoogstraten

03/29/2023, 11:05 AM

if I search for

for example, nothing happens -> no payload… what part of the code is preventing this from happening?

Kishore Nallan

03/29/2023, 11:07 AM

How about just checking

/^[^a-zA-Z0-9]+$/.test(helper.state.query)

to check if query string contains only alpha numeric? That can be added along with the empty string check here: https://github.com/typesense/showcase-songs-search/blob/e7ad97ce4e09191743abd727c2dfc949811bbcd6/src/app.js#L176

Jan Willem Hoogstraten

03/29/2023, 11:09 AM

Thanks Kishore, going to give it a try!

Kishore Nallan

03/29/2023, 11:13 AM

Actually the above will not allow a string with both special characters and alpha numeric like

foo? bar

Kishore Nallan

03/29/2023, 11:17 AM

This will work:

Copy code

/[a-zA-Z0-9]/.test(helper.state.query)

Kishore Nallan

03/29/2023, 11:18 AM

Will return

true

if atleast one alpha numeric character appears in the query string, which is what we want here.

2 Views

Open in Slack

Previous Next