#community-help

Addressing `num_typos` Inconsistency in Document Search

TLDR John had an issue with num_typos inconsistency when using prefix search. Kishore Nallan clarified the technical aspects, adjusted the aggressiveness of the feature and resolved the issue. They also discussed a limit on num_typos value.

Powered by Struct AI
yw

1

12
17mo
Solved
Join the chat
Jul 13, 2022 (18 months ago)
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
12:22 PM
It doesn’t seem like num_typos is consistently respected, in our production use-case we get results with edit distance 4 even though we have num_typos: 2 . It only happens with prefix search turned on.

If I have two documents with storka and sparkling and search for starkbin (edit distance 4 and 3 respectively) I get both results on 0.24.0.rc16 and only sparkling on 0.23.0
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:24 PM
I will check this out and get back to you. I've the example from the other thread on my list as well.
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
12:24 PM
Thank you Kishore, much appreciated
Jul 14, 2022 (17 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:20 AM
This shouldn't match, if you look at it purely from a technical perspective. However, we had a use case where sometimes people expected a query like q=strawberries to match against a word like strawberry -- since the query word is longer than the indexed word, this is not a prefix search technically, but it "looks" correct and needed to be matched, especially in English where searching for plural form of singular words are common.

So we had to add a special condition here: https://github.com/typesense/typesense/blob/main/src/art.cpp#L1319

The condition allows matching if all these criteria match:

a) The indexed word is greater than 5 chars (to reduce false positives)
b) the word in the query is greater than the indexed word (like the strawberry example above)
c) if the difference in their length is within the maximum typo allowed (in this case it is 2)

Unfortunately when we make some relaxations like that, some other non-obvious cases like storka/starkbin can match and look odd.
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
11:26 AM
Thanks Kishore, I understand and we’ve had some issues with that before as well. It seems like some lemmatization would make sense, e.g. normalizing plural tokens to singular, but I guess that’s much harder to implement as well 🙂
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:27 AM
Yes, lemmatization is a language specific feature.
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
03:19 PM
Would it be hard to add a parameter to disable this feature? Would be very useful for us.
Jul 15, 2022 (17 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
06:04 AM
I will take a look to see if we can make it less aggressive. Maybe allowing only upto 1 typo difference in these cases which will reduce the number of false positives. Will have a build in a few days to test this.
Jul 19, 2022 (17 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:33 AM
John I've made this less aggressive in 0.24.0.rc20
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
11:41 AM
Thanks! I tried it out and it seems to work well now. Strangely “starkbin” never matches “storka” for me, even with num_typos=10 , hmm (that’s not a problem since we never want it to match but I would expect it to!)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:42 AM
num_typos can only be 0,1 or 2. It's too expensive to fuzzy matching above that.
John
Photo of md5-21545f1facb7836c149bc4c70752bd2b
John
11:42 AM
Makes sense, thanks for the clarification.
yw

1

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3005 threads (79% resolved)

Join Our Community

Similar Threads

Understanding 'max_candidates' and 'num_typos' Parameters in Typesense

Narayan asked about difference between 'max_candidates' and 'num_typos' parameters in typo tolerance within Typesense. Jason referred them to the documentation. Kishore Nallan offered clarity and answered Narayan's follow-up questions, as well as addressed Akash's query about case sensitivity in Typesense.

3

14
2w
Solved

Issue with Typo Correction/Prefix Search and the Role of max_candidates

John noticed inconsistent search results based on max_candidates settings, and Kishore Nallan clarified its role for multi-word queries. They resolved that increasing max_candidates ensures the query isn't prematurely limited.

2

10
18mo
Solved

Typesense Search Solution Issues

Rolando faced incorrect search results using Typesense. Kishore Nallan suggested changing typo parameters and upgrading Typesense version. However, undesired results persisted and need further investigation.

1

14
31mo

Phrase Search Relevancy and Weights Fix

Jan reported an issue with phrase search relevancy using Typesense Instantsearch Adapter. The problem occurred when searching phrases with double quotes. The team identified the issue to be related to weights and implemented a fix, improving the search results.

6

111
8mo
Solved

Understanding Typo Tolerance in Search Queries

gab sought clarity on typo tolerance settings in search operations, specifically on the discrepancy in document returns when typos are involved. Kishore Nallan explained the "num_typos" and "typo_tokens_threshold" parameters within search queries, and how they dictate typo allowance during searches.

2

13
22mo
Solved