#community-help

Addressing `num_typos` Inconsistency in Document Search

TLDR John had an issue with num_typos inconsistency when using prefix search. Kishore Nallan clarified the technical aspects, adjusted the aggressiveness of the feature and resolved the issue. They also discussed a limit on num_typos value.

Powered by Struct AI
yw

1

12
15mo
Solved
Join the chat
Jul 13, 2022 (15 months ago)
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
12:22 PM
It doesn’t seem like num_typos is consistently respected, in our production use-case we get results with edit distance 4 even though we have num_typos: 2 . It only happens with prefix search turned on.

If I have two documents with storka and sparkling and search for starkbin (edit distance 4 and 3 respectively) I get both results on 0.24.0.rc16 and only sparkling on 0.23.0
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:24 PM
I will check this out and get back to you. I've the example from the other thread on my list as well.
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
12:24 PM
Thank you Kishore, much appreciated
Jul 14, 2022 (15 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:20 AM
This shouldn't match, if you look at it purely from a technical perspective. However, we had a use case where sometimes people expected a query like q=strawberries to match against a word like strawberry -- since the query word is longer than the indexed word, this is not a prefix search technically, but it "looks" correct and needed to be matched, especially in English where searching for plural form of singular words are common.

So we had to add a special condition here: https://github.com/typesense/typesense/blob/main/src/art.cpp#L1319

The condition allows matching if all these criteria match:

a) The indexed word is greater than 5 chars (to reduce false positives)
b) the word in the query is greater than the indexed word (like the strawberry example above)
c) if the difference in their length is within the maximum typo allowed (in this case it is 2)

Unfortunately when we make some relaxations like that, some other non-obvious cases like storka/starkbin can match and look odd.
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
11:26 AM
Thanks Kishore, I understand and we’ve had some issues with that before as well. It seems like some lemmatization would make sense, e.g. normalizing plural tokens to singular, but I guess that’s much harder to implement as well 🙂
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:27 AM
Yes, lemmatization is a language specific feature.
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
03:19 PM
Would it be hard to add a parameter to disable this feature? Would be very useful for us.
Jul 15, 2022 (15 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
06:04 AM
I will take a look to see if we can make it less aggressive. Maybe allowing only upto 1 typo difference in these cases which will reduce the number of false positives. Will have a build in a few days to test this.
Jul 19, 2022 (15 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:33 AM
John I've made this less aggressive in 0.24.0.rc20
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
11:41 AM
Thanks! I tried it out and it seems to work well now. Strangely “starkbin” never matches “storka” for me, even with num_typos=10 , hmm (that’s not a problem since we never want it to match but I would expect it to!)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:42 AM
num_typos can only be 0,1 or 2. It's too expensive to fuzzy matching above that.
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
11:42 AM
Makes sense, thanks for the clarification.
yw

1