Issue with Text Match Score and Token Matching
TLDR Stefan reported an issue with "text_match_score" and token matching for products. Kishore Nallan explained the logic behind the ranking and suggested using "?text_match_type=max_weight", then fixed the "drop tokens" behavior for better token matching.


Apr 04, 2023 (5 months ago)
Stefan
07:37 AMReproduction: https://github.com/reinoldus/typesense-reproduction-repo/tree/token-count-weird
(data in thread)
Stefan
07:38 AMStefan
07:58 AMRUNNING DOCKER CONTAINER
BUILDING TEST IMAGE
====== TEST OUTPUT START
CREATING SCHEMA
{'created_at': 1680593538, 'default_sorting_field': '', 'enable_nested_fields': False, 'fields': [{'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'brand', 'optional': False, 'sort': False, 'type': 'string'}, {'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'name', 'optional': False, 'sort': False, 'type': 'string'}], 'name': 'test_collection', 'num_documents': 0, 'symbols_to_index': [], 'token_separators': []}
Ingesting data
Item import response:
{'brand': 'Wrong brand name', 'id': '0', 'name': 'Neutrogena Ultra Sheer Dry-Touch Sunscreen - / Exp 2022 SPF 100'}
Item import response:
{'brand': 'Neutrogena', 'id': '1', 'name': "'Ultra Sheer Oil-Free Face Serum With Vitamin E + SPF 60'"}
NUMBER OF DOCUMENTS IN COLLECTION: 2
---------- START TEST QUERY
---------- Search parameters
{'q': 'Neutrogena Ultra Sheer Moisturizing Face Serum',
'query_by': 'brand,name',
'query_by_weights': '3,2'}
---------- Query output
{'facet_counts': [],
'found': 2,
'hits': [{'document': {'brand': 'Wrong brand name',
'id': '0',
'name': 'Neutrogena Ultra Sheer Dry-Touch Sunscreen - '
'/ Exp 2022 SPF 100'},
'highlight': {'name': {'matched_tokens': ['Neutrogena',
'Ultra',
'Sheer'],
'snippet': '<mark>Neutrogena</mark> '
'<mark>Ultra</mark> '
'<mark>Sheer</mark> Dry-Touch '
'Sunscreen - / Exp 2022 SPF 100'}},
'highlights': [{'field': 'name',
'matched_tokens': ['Neutrogena', 'Ultra', 'Sheer'],
'snippet': '<mark>Neutrogena</mark> '
'<mark>Ultra</mark> <mark>Sheer</mark> '
'Dry-Touch Sunscreen - / Exp 2022 SPF '
'100'}],
'text_match': 1736172819517014033,
'text_match_info': {'best_field_score': '3315704397824',
'best_field_weight': 2,
'fields_matched': 1,
'score': '1736172819517014033',
'tokens_matched': 3}},
{'document': {'brand': 'Neutrogena',
'id': '1',
'name': "'Ultra Sheer Oil-Free Face Serum With Vitamin "
"E + SPF 60'"},
'highlight': {'brand': {'matched_tokens': ['Neutrogena'],
'snippet': '<mark>Neutrogena</mark>'},
'name': {'matched_tokens': ['Ultra',
'Sheer',
'Face',
'Serum'],
'snippet': "'<mark>Ultra</mark> "
'<mark>Sheer</mark> Oil-Free '
'<mark>Face</mark> '
'<mark>Serum</mark> With Vitamin '
"E + SPF 60'"}},
'highlights': [{'field': 'name',
'matched_tokens': ['Ultra',
'Sheer',
'Face',
'Serum'],
'snippet': "'<mark>Ultra</mark> <mark>Sheer</mark> "
'Oil-Free <mark>Face</mark> '
'<mark>Serum</mark> With Vitamin E + SPF '
"60'"},
{'field': 'brand',
'matched_tokens': ['Neutrogena'],
'snippet': '<mark>Neutrogena</mark>'}],
'text_match': 1733912223744524306,
'text_match_info': {'best_field_score': '2211897868288',
'best_field_weight': 2,
'fields_matched': 2,
'score': '1733912223744524306',
'tokens_matched': 3}}],
'out_of': 2,
'page': 1,
'request_params': {'collection_name': 'test_collection',
'per_page': 10,
'q': 'Neutrogena Ultra Sheer Moisturizing Face Serum'},
'search_cutoff': False,
'search_time_ms': 0}
====== OUTPUT END
test-typesense
Untagged: test-typesense-python:latest
Deleted: sha256:c8a4e607164c48400771800f3b25f9e820f0d5011809a0cf0a002c6de7cb965c
typesense-test-network
Stefan
07:58 AMKishore Nallan
07:59 AMKishore Nallan
08:18 AMLet me first explain how Typesense handles multi-field text match ranking in the default mode.
- For a given record, we compute a text match score for every field based on how much that field value overlaps with the query tokens. We consider number of overlapping tokens, number of typos etc. to arrive at a per-field score.
- Let's say we are querying 2 fields (brand, name in this this case). This will result in text match scores A and B respectively. The highest text match score among the fields becomes the representative score for this record.
- When we rank all the records, this representative score is first checked. The field weight only acts as a tie-breaker when 2 records have have the same representative text match score.
Some people wanted this behavior because in many other cases absolute degree of text match mattered more than the weight. To accomodate the behavior you desire here, we've intrduced a flag. Send "?text_match_type=max_weight" parameter in the search requests.
Kishore Nallan
08:19 AMStefan
08:51 AMI have a smaller issue now though, maybe I am to focused on this case, but lmk what you think. Now these 3 products rank like this:
• Neutrogena Ultra Sheer Dry-Touch Sunscreen SPF 100
• Neutrogena Ultra Sheer Oil-Free Face Serum With Vitamin E + SPF 60'
• Neutrogena Ultra Sheer Liquid Sunscreen SPF 70
They all have the same text_match_score: 1736146521082036226
But the second product matches one token more (I assume there is penalty for "unmatched" tokens), is there a way to prefer names that match more tokens?
Kishore Nallan
09:22 AMStefan
09:27 AMthe first one matches, neutrogena, ultra, sheer
the second one: neutrogena, ultra, sheer and serum
the third one: neutrogena, ultra, sheer
for query: Neutrogena Ultra Sheer Moisturizing Face Serum
Kishore Nallan
09:28 AMStefan
09:28 AMKishore Nallan
09:54 AMNeutrogena Ultra Sheer
--> so both records only match on 3 tokens (even though we later on highlight other tokens in the results. I think we should try and re-match other tokens in the query to see if they exist.Kishore Nallan
09:56 AMStefan
10:02 AMKishore Nallan
10:12 AM
Apr 10, 2023 (5 months ago)
Kishore Nallan
08:43 AMtypesense/typesense:0.25.0.rc21
-- can you please give it a spin?Stefan
08:46 AM
Typesense
Indexed 2764 threads (79% resolved)
Similar Threads
Phrase Search Relevancy and Weights Fix
Jan reported an issue with phrase search relevancy using Typesense Instantsearch Adapter. The problem occurred when searching phrases with double quotes. The team identified the issue to be related to weights and implemented a fix, improving the search results.


Troubleshooting Search Results for Health Products
Tom is having issues with search results when adding extra words in the query. Jason suggests using `max_candidates` and `exhaustive_search`, but needs more time to find a proper solution.

Issues With `text_match` Scoring for Search Queries in Typesense
Colin encountered issues with the `text_match` scoring on Typesense v0.23.1. Jason and Kishore Nallan identified a potential issue with numeric overflow in the text match score and applied an unverified patch. The final resolution is unclear.



Query on "weighted_score" & Issue with Synonym Highlighting
Stefan asked about "weighted_score" field and reported a possible synonym highlighting issue. Kishore Nallan clarified the use of "weighted_score". The possible synonym issue is still being investigated.
Resolving Typesense Result Issue in Document Collection Queries
Mike was encountering errors when searching for specific query in their Typesense document collection. Jason suggested it may be due to the `drop_tokens_threshold` setting. There was a misunderstanding but after further explanation from Jason, Mike understood and decided to continue the conversation via email.
