#contributions

Issue with Text Match Score and Token Matching

TLDR Stefan reported an issue with "text_match_score" and token matching for products. Kishore Nallan explained the logic behind the ranking and suggested using "?text_match_type=max_weight", then fixed the "drop tokens" behavior for better token matching.

Powered by Struct AI
+11
raised_hands1
18
5mo
Solved
Join the chat
Apr 04, 2023 (5 months ago)
Stefan
Photo of md5-d6c265b4792dbf0a1d6ae378f39d8736
Stefan
07:37 AM
Hello, is there a known bug with the "text_match_score", I would expect that the second product here would be listed first, given that it matches brand and matches more tokens on "name":

Reproduction: https://github.com/reinoldus/typesense-reproduction-repo/tree/token-count-weird

(data in thread)
07:38
Stefan
07:38 AM
Also shouldn't the "tokens_matches" value be 5 in the case of the second broduct, since it matches "Ultra", ""Sheer", "face", "serum" and "Neutrogena". ALso the "best field weight" shouldn't that be 3, since it matches "brand"?
07:58
Stefan
07:58 AM
RUNNING DOCKER CONTAINER
BUILDING TEST IMAGE
====== TEST OUTPUT START
CREATING SCHEMA
{'created_at': 1680593538, 'default_sorting_field': '', 'enable_nested_fields': False, 'fields': [{'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'brand', 'optional': False, 'sort': False, 'type': 'string'}, {'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'name', 'optional': False, 'sort': False, 'type': 'string'}], 'name': 'test_collection', 'num_documents': 0, 'symbols_to_index': [], 'token_separators': []}
Ingesting data
Item import response:
         {'brand': 'Wrong brand name', 'id': '0', 'name': 'Neutrogena Ultra Sheer Dry-Touch Sunscreen - / Exp 2022 SPF 100'}
Item import response:
         {'brand': 'Neutrogena', 'id': '1', 'name': "'Ultra Sheer Oil-Free Face Serum With Vitamin E + SPF 60'"}
NUMBER OF DOCUMENTS IN COLLECTION: 2
---------- START TEST QUERY
---------- Search parameters
{'q': 'Neutrogena Ultra Sheer Moisturizing Face Serum',
 'query_by': 'brand,name',
 'query_by_weights': '3,2'}
---------- Query output
{'facet_counts': [],
 'found': 2,
 'hits': [{'document': {'brand': 'Wrong brand name',
                        'id': '0',
                        'name': 'Neutrogena Ultra Sheer Dry-Touch Sunscreen - '
                                '/ Exp 2022 SPF 100'},
           'highlight': {'name': {'matched_tokens': ['Neutrogena',
                                                     'Ultra',
                                                     'Sheer'],
                                  'snippet': '<mark>Neutrogena</mark> '
                                             '<mark>Ultra</mark> '
                                             '<mark>Sheer</mark> Dry-Touch '
                                             'Sunscreen - / Exp 2022 SPF 100'}},
           'highlights': [{'field': 'name',
                           'matched_tokens': ['Neutrogena', 'Ultra', 'Sheer'],
                           'snippet': '<mark>Neutrogena</mark> '
                                      '<mark>Ultra</mark> <mark>Sheer</mark> '
                                      'Dry-Touch Sunscreen - / Exp 2022 SPF '
                                      '100'}],
           'text_match': 1736172819517014033,
           'text_match_info': {'best_field_score': '3315704397824',
                               'best_field_weight': 2,
                               'fields_matched': 1,
                               'score': '1736172819517014033',
                               'tokens_matched': 3}},
          {'document': {'brand': 'Neutrogena',
                        'id': '1',
                        'name': "'Ultra Sheer Oil-Free Face Serum With Vitamin "
                                "E + SPF 60'"},
           'highlight': {'brand': {'matched_tokens': ['Neutrogena'],
                                   'snippet': '<mark>Neutrogena</mark>'},
                         'name': {'matched_tokens': ['Ultra',
                                                     'Sheer',
                                                     'Face',
                                                     'Serum'],
                                  'snippet': "'<mark>Ultra</mark> "
                                             '<mark>Sheer</mark> Oil-Free '
                                             '<mark>Face</mark> '
                                             '<mark>Serum</mark> With Vitamin '
                                             "E + SPF 60'"}},
           'highlights': [{'field': 'name',
                           'matched_tokens': ['Ultra',
                                              'Sheer',
                                              'Face',
                                              'Serum'],
                           'snippet': "'<mark>Ultra</mark> <mark>Sheer</mark> "
                                      'Oil-Free <mark>Face</mark> '
                                      '<mark>Serum</mark> With Vitamin E + SPF '
                                      "60'"},
                          {'field': 'brand',
                           'matched_tokens': ['Neutrogena'],
                           'snippet': '<mark>Neutrogena</mark>'}],
           'text_match': 1733912223744524306,
           'text_match_info': {'best_field_score': '2211897868288',
                               'best_field_weight': 2,
                               'fields_matched': 2,
                               'score': '1733912223744524306',
                               'tokens_matched': 3}}],
 'out_of': 2,
 'page': 1,
 'request_params': {'collection_name': 'test_collection',
                    'per_page': 10,
                    'q': 'Neutrogena Ultra Sheer Moisturizing Face Serum'},
 'search_cutoff': False,
 'search_time_ms': 0}
====== OUTPUT END
test-typesense
Untagged: test-typesense-python:latest
Deleted: sha256:c8a4e607164c48400771800f3b25f9e820f0d5011809a0cf0a002c6de7cb965c
typesense-test-network
07:58
Stefan
07:58 AM
reposted the snippet here not to flood the chat
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:59 AM
Looking
08:18
Kishore Nallan
08:18 AM
We have gone back and forth a bit to accommodate a couple of ways in which people wanted the weights to behave in cross-field matching.

Let me first explain how Typesense handles multi-field text match ranking in the default mode.

- For a given record, we compute a text match score for every field based on how much that field value overlaps with the query tokens. We consider number of overlapping tokens, number of typos etc. to arrive at a per-field score.

- Let's say we are querying 2 fields (brand, name in this this case). This will result in text match scores A and B respectively. The highest text match score among the fields becomes the representative score for this record.

- When we rank all the records, this representative score is first checked. The field weight only acts as a tie-breaker when 2 records have have the same representative text match score.

Some people wanted this behavior because in many other cases absolute degree of text match mattered more than the weight. To accomodate the behavior you desire here, we've intrduced a flag. Send "?text_match_type=max_weight" parameter in the search requests.
08:19
Kishore Nallan
08:19 AM
The default value for this parameter is "max_score" which treats the field with best text matching score as an anchor for ranking.
Stefan
Photo of md5-d6c265b4792dbf0a1d6ae378f39d8736
Stefan
08:51 AM
That makes sense, thank you! That fixed it!

I have a smaller issue now though, maybe I am to focused on this case, but lmk what you think. Now these 3 products rank like this:
• Neutrogena Ultra Sheer Dry-Touch Sunscreen SPF 100
• Neutrogena Ultra Sheer Oil-Free Face Serum With Vitamin E + SPF 60'
• Neutrogena Ultra Sheer Liquid Sunscreen SPF 70
They all have the same text_match_score: 1736146521082036226

But the second product matches one token more (I assume there is penalty for "unmatched" tokens), is there a way to prefer names that match more tokens?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:22 AM
Even if "Neutrogena" appears both in name and brand we count it only once. So there's no way to prioritize on greater frequency of occurence of a token.
Stefan
Photo of md5-d6c265b4792dbf0a1d6ae378f39d8736
Stefan
09:27 AM
Hmm, that's not the issue here I think:
the first one matches, neutrogena, ultra, sheer
the second one: neutrogena, ultra, sheer and serum
the third one: neutrogena, ultra, sheer

for query: Neutrogena Ultra Sheer Moisturizing Face Serum
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:28 AM
Both have the same brand?
Stefan
Photo of md5-d6c265b4792dbf0a1d6ae378f39d8736
Stefan
09:28 AM
yes all three
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:54 AM
That's because of the "drop tokens" behavior. Since not all tokens are found in the document, Typesense tries to drop tokens right to left until query becomes Neutrogena Ultra Sheer --> so both records only match on 3 tokens (even though we later on highlight other tokens in the results. I think we should try and re-match other tokens in the query to see if they exist.
09:56
Kishore Nallan
09:56 AM
I'll try to fix this, this week.
Stefan
Photo of md5-d6c265b4792dbf0a1d6ae378f39d8736
Stefan
10:02 AM
Thank you! So I don't need to create an issue?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:12 AM
Not needed. It's already on my list now 🙂
+11
Apr 10, 2023 (5 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:43 AM
Stefan This is fixed in typesense/typesense:0.25.0.rc21 -- can you please give it a spin?
Stefan
Photo of md5-d6c265b4792dbf0a1d6ae378f39d8736
Stefan
08:46 AM
Seems to work! At least on my test bench, will run a few of more tests. Thank you!
raised_hands1