#community-help

Question About Text_Match and Token Reoccurrence

TLDR Paul asked about the effect of token reoccurrence on _text_match. Kishore Nallan informed them it's not taken into account due to issues with "keyword stuffing". Jason suggested breaking long form content into multiple documents to improve search result relevance.

Powered by Struct AI
4
12mo
Solved
Join the chat
Oct 21, 2022 (12 months ago)
Paul
Photo of md5-001a6b8b05601dc8ac56c5f364768cc1
Paul
02:20 PM
Hi All, just wondering if anyone knows whether _text_match should be affected by the number of times a token is found within a field? Currently I’m seeing the same score for hits, regardless how many times the token appears within the field. For example, searching for a single word, any result that contains that token at least once, is returned but with the same score even though one result has the token 10 times in the body field and another result only has 1 occurrence of the token.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:30 PM
We don't take number of repeating tokens into account for match_score. We used to in earlier versions but caused so many false positives due to "keyword stuffing" that we decided to not use it anymore.
Paul
Photo of md5-001a6b8b05601dc8ac56c5f364768cc1
Paul
03:43 PM
hmm, but on sites where there is no user generated content it doesn’t make sense that an article that referrers to something 100 times, would be equal to an article that only references it once. It means having to add keyword field to articles rather than looking at the natural content of an article. Is there really no way to make this work?

There isn’t any tie-breaking I can apply, so is typesense not really suitable for long form content?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:44 PM
For long form content, you want to break it out into multiple documents by say paragraphs or lines to increase the granularity of search results, which in turn improves relevance.