Hello wave I am getting some unexpected results when queryin typesense #community-help

Hello :wave: I am getting some unexpected results...

Rajaie

12/17/2024, 3:25 AM

Hello 👋 I am getting some unexpected results when querying my collection and am wondering if I might be doing something wrong here. According to https://typesense.org/docs/guide/ranking-and-relevance.html, proximity of words in a field determines the score for that field. I'm on 27.1 and using the default

text_match_type

Copy code

Proximity: Whether the query tokens appear verbatim or interspersed with other tokens in the field. Documents in which the query tokens appear right next to each other will be ranked above documents where the query tokens exist but are far apart in a text field.

But when I am querying a string field (

title

) for "*Android Mobile Phone*", the top 2 results have a title with an exact match of

android mobile phone

. Both have the same

best_field_score

. The next 8 results all have the same

score

text_match

, and

best_field_score

values, which isn't expected given their values: 1. best android mobile phone 2. top android mobile phone 3. top android mobile & smart phone 4. android mobile phone - best mobile I would expect #4 to have the highest score since the word

mobile

is mentioned twice (

frequency

bullet point from the above link), and #3 to be last because the words aren't right next to each other. #1 and #2 should be in positions #2 and #3.

Fanis Tharropoulos

12/17/2024, 7:53 AM

Using the

prioritize_token_position

parameter and setting it to true will boost results that have the queried tokens appear earlier in the document

Rajaie

12/17/2024, 8:11 AM

thanks for the reply. i came across that parameter, but i don't think it's relevant to the issue/question I have. for the list item #3, that has a higher frequency of token matches than the other list items, so my question there is why it doesn't have a higher score. as for token position, my example isn't really about positioning of the tokens, but about their proximity to each other

Kishore Nallan

12/18/2024, 1:37 AM

Are you querying on only the title field? The

best_field_score

with the default text match type will pick the best score for the query among all fields queried. Is there are another field where a direct verbatim match of

android mobile phone

occurs? Otherwise, please share the full JSON response returned by Typesense.

Rajaie

12/18/2024, 5:42 AM

Untitled

Rajaie

12/18/2024, 5:42 AM

Yep, only querying on the title field. I've been trying to reproduce what I'm seeing in production on my local machine for the proximity issue, but no luck. I was able to reproduce the issue relating to frequency though. When I run the script above, I get the following results

Copy code

android mobile phone
android mobile phone
top android mobile phone - best android mobile phone
top android mobile phone - best android mobile phone
top android mobile phone
best android mobile phone
top android mobile phone - best android mobile phone

i would expect

Copy code

top android mobile phone
best android mobile phone

to be in last place since the current last result has 2 repetitions of

android mobile phone

Rajaie

12/18/2024, 5:44 AM

JSON output from running script above

Untitled

Rajaie

12/18/2024, 5:46 AM

As for the issue I'm seeing relating to token proximity, here's the output from the query below

Copy code

end_time="$(date +%s)"
day=$((60*60*24))
start_time=$(($end_time - 30*$day))
country="CA"
show_highlights=""
include_fields="title"
query_by_weights="1"
query_by="title"
query="site%20reliability%20engineer"
curl  -H 'X-TYPESENSE-API-KEY: xyz' "<http://localhost:8108/collections/products/documents/search?q=$query&per_page=100&page=1&query_by=$query_by&highlight_fields=$show_highlights&include_fields=$include_fields&query_by_weights=$query_by_weights&filter_by=posting_date%3A%5B$start_time..$end_time%5D&enable_highlight_v1=false>" | jq .

Rajaie

12/18/2024, 5:47 AM

I would expect

Principal Site Reliability & Cloud Engineer - Americas

to be in last place

Untitled

Kishore Nallan

12/18/2024, 3:46 PM

since the current last result has 2 repetitions of
android mobile phone

We don't count repetitions of a token because in many practical cases we found that to surface bad records with too many repeated words.

👍 1

Kishore Nallan

12/18/2024, 4:07 PM

> I would expect

Principal Site Reliability & Cloud Engineer - Americas

to be in last place This one still look weird. Would need access to your dataset, atleast a small portion which can reproduce this issue. Maybe just the title field which should not be containing anything sensitive for sharing.

Rajaie

12/18/2024, 4:31 PM

The odd part is that I wasn't able to reproduce this locally. I tried using the script above which uses a subset of my data but when I run the same query, I get the results in the expected order. I'll follow up in this thread if I'm able to reproduce this at some point. Thank you for the help!

Untitled

👍 1

Rajaie

12/18/2024, 5:09 PM

fyi in case you missed it, i shared the json output from production that has the title fields here https://typesense-community.slack.com/files/U084NBVQQPR/F085Z7HR56V/untitled not sure if that has what you're looking for

Untitled

Kishore Nallan

12/19/2024, 5:52 AM

Yes I saw that, but I'm unable to see how we end up computing the text match score that way, where the record without consecutively occuring query words is not getting lower score.

Open in Slack

Previous Next