Hello :wave: I am getting some unexpected results...
# community-help
r
Hello 👋 I am getting some unexpected results when querying my collection and am wondering if I might be doing something wrong here. According to https://typesense.org/docs/guide/ranking-and-relevance.html, proximity of words in a field determines the score for that field. I'm on 27.1 and using the default
text_match_type
.
Copy code
Proximity: Whether the query tokens appear verbatim or interspersed with other tokens in the field. Documents in which the query tokens appear right next to each other will be ranked above documents where the query tokens exist but are far apart in a text field.
But when I am querying a string field (
title
) for "*Android Mobile Phone*", the top 2 results have a title with an exact match of
android mobile phone
. Both have the same
best_field_score
. The next 8 results all have the same
score
,
text_match
, and
best_field_score
values, which isn't expected given their values: 1. best android mobile phone 2. top android mobile phone 3. top android mobile & smart phone 4. android mobile phone - best mobile I would expect #4 to have the highest score since the word
mobile
is mentioned twice (
frequency
bullet point from the above link), and #3 to be last because the words aren't right next to each other. #1 and #2 should be in positions #2 and #3.
f
Using the
prioritize_token_position
parameter and setting it to true will boost results that have the queried tokens appear earlier in the document
r
thanks for the reply. i came across that parameter, but i don't think it's relevant to the issue/question I have. for the list item #3, that has a higher frequency of token matches than the other list items, so my question there is why it doesn't have a higher score. as for token position, my example isn't really about positioning of the tokens, but about their proximity to each other
k
Are you querying on only the title field? The
best_field_score
with the default text match type will pick the best score for the query among all fields queried. Is there are another field where a direct verbatim match of
android mobile phone
occurs? Otherwise, please share the full JSON response returned by Typesense.
r
Untitled
Yep, only querying on the title field. I've been trying to reproduce what I'm seeing in production on my local machine for the proximity issue, but no luck. I was able to reproduce the issue relating to frequency though. When I run the script above, I get the following results
Copy code
android mobile phone
android mobile phone
top android mobile phone - best android mobile phone
top android mobile phone - best android mobile phone
top android mobile phone
best android mobile phone
top android mobile phone - best android mobile phone
i would expect
Copy code
top android mobile phone
best android mobile phone
to be in last place since the current last result has 2 repetitions of
android mobile phone
JSON output from running script above
As for the issue I'm seeing relating to token proximity, here's the output from the query below
Copy code
end_time="$(date +%s)"
day=$((60*60*24))
start_time=$(($end_time - 30*$day))
country="CA"
show_highlights=""
include_fields="title"
query_by_weights="1"
query_by="title"
query="site%20reliability%20engineer"
curl  -H 'X-TYPESENSE-API-KEY: xyz' "<http://localhost:8108/collections/products/documents/search?q=$query&per_page=100&page=1&query_by=$query_by&highlight_fields=$show_highlights&include_fields=$include_fields&query_by_weights=$query_by_weights&filter_by=posting_date%3A%5B$start_time..$end_time%5D&enable_highlight_v1=false>" | jq .
I would expect
Principal Site Reliability & Cloud Engineer - Americas
to be in last place
k
since the current last result has 2 repetitions of
android mobile phone
We don't count repetitions of a token because in many practical cases we found that to surface bad records with too many repeated words.
👍 1
> I would expect
Principal Site Reliability & Cloud Engineer - Americas
to be in last place This one still look weird. Would need access to your dataset, atleast a small portion which can reproduce this issue. Maybe just the title field which should not be containing anything sensitive for sharing.
r
The odd part is that I wasn't able to reproduce this locally. I tried using the script above which uses a subset of my data but when I run the same query, I get the results in the expected order. I'll follow up in this thread if I'm able to reproduce this at some point. Thank you for the help!
👍 1
fyi in case you missed it, i shared the json output from production that has the title fields here https://typesense-community.slack.com/files/U084NBVQQPR/F085Z7HR56V/untitled not sure if that has what you're looking for
k
Yes I saw that, but I'm unable to see how we end up computing the text match score that way, where the record without consecutively occuring query words is not getting lower score.