#community-help

Issues with Text Match Info and Split Tokens

TLDR Dima reported weird ranking and confusion with text_match_info. Kishore Nallan clarified split tokens don't do prefix searches but have higher rank due to more matched words. Suggested creating a Github issue for further investigation.

Powered by Struct AI

1

1

1

Apr 17, 2023 (5 months ago)
Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
02:15 PM
Hi team! Do you have any documentation for text_match_info. Got weird ranking, trying to research a problem and this field looks like good source of debug info, but I’m not sure how to read it
02:17
Dima
02:17 PM
Weird ranking included <mark>T</mark> arget <mark>esti</mark>mation somewhere on the first page for test keyword
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:23 PM
What client are you using? A typed client may not expose this type yet. Raw API response will have it.
Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
02:23 PM
Raw API, yes
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:02 PM
I don't follow, what is it about the text match info you are confused about?
Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
03:06 PM
I will create reproducible example little bit later. But with split_join_tokens: always I got T + esti at the first place for q: test while I also have full match on the 4-5 places (<mark>Test</mark> cards)
03:08
Dima
03:08 PM
So two problems here:
• I expect that split_join_tokens will not find T + estimates, maybe T + est but without prefix search
• I expect that full match will get more score than split one
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:12 PM
Split join tokens probably should not be allowed to do prefix search

1

Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
05:02 PM
I also got this weird highlight for test keyword 🐸 Will add it to the example
Image 1 for I also got this weird highlight for `test` keyword :frog:  Will add it to the example
06:16
Dima
06:16 PM
08:05
Dima
08:05 PM
Thought little bit about it and found why split tokens related hits are always on the first places. I think it’s because basketball is one matched token, while basket ball are two matched tokens, so the second version will always have more match score if split_join_tokens is set to always:
https://gist.github.com/b0g3r/69a2268cc0965ce706a06b8d7ae108e1

From my point of view both basketball and basket ball should have the same weight, but it looks very hard to implement if we add something like he *basket* his *ball*
Apr 28, 2023 (5 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:14 AM
Dima

Just got a chance to look into this in more detail. Split tokens don't do prefix searches. However, while highlighting we end up highlighting the split word, if present as a prefix in the text. For e.g. test could be split as t + est and if there is a word like estimation in the text then that gets highlighted as <mark>est</mark>imation -- however we don't "search" for estimation , just highlight if present.

1

11:15
Kishore Nallan
11:15 AM
However, your point about split searches being ranked higher is correct: because the split words are going to have more words than than the unsplit word, they are matched higher. Can you please create a Github issue for this for tracking? We will have to see how to handle it.

1