#community-help

Troubleshooting Typo Tolerance Issue with Typesense for Korean

TLDR Minyong informed Kishore Nallan about a typo tolerance issue in Typesense with Korean text. Kishore Nallan suggested adjusting the byte difference limit for Korean, but warned this could slow down the search function. Minyong approved testing the solution.

Powered by Struct AI

1

1

1

Oct 19, 2022 (14 months ago)
Minyong
Photo of md5-759093f8dd119c1eccc94be20b2fc247
Minyong
08:36 AM
[Typo tolerance for Korean]
Hello! I am using typesense for a search app. Our records have fields which are mixes of English and Korean text. I am trying to make the search as lenient as possible to increase recall — in that sense, typo tolerance is very important. However, typo tolerance doesn’t seem to work well for korean text. Could you take a look at this reproducible example?

https://gist.github.com/minyonglee/d0129025d04192d8f09f236f4d11165b
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:49 AM
👋 I don't know Korean. Can you please explain what you expect to see vs what you are seeing in the example above?

1

Minyong
Photo of md5-759093f8dd119c1eccc94be20b2fc247
Minyong
08:52 AM
yes! summary:

q: '김철수님 부트캠프'
field: '김철수 부트캠프'
result: matched_tokens: [ '부트캠프' ]
expected: matched_tokens: [ '김철수님', '부트캠프' ]

english works well
q: 'Kevin Jordan'
field: 'Kev Jordan'
result: matched_tokens: [ 'Kev', 'Jordan' ]
expected: matched_tokens: [ 'Kev', 'Jordan' ]
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:14 AM
Okay so you are expecting 김철수님 to be a prefix match against 김철수?

1

09:16
Kishore Nallan
09:16 AM
I think the issue is because Typesense allows upto 2 character typos, but this is based on bytes, so really it's 2 bytes of difference is allowed. With Korean, each character can be several bytes in unicode so the fuzzy matching is not working. We limit num_typos to 2 bytes because values greater than that is quite expensive.

1

Minyong
Photo of md5-759093f8dd119c1eccc94be20b2fc247
Minyong
09:28 AM
that makes sense. is there anything I can do to match those two words?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:31 AM
We can experimentally enable upto 6 byte differences for Korean.
09:32
Kishore Nallan
09:32 AM
But I've no idea how slow that will be. Since an average Hangul is 3 bytes in UTF-8, 6 bytes will be roughly equal to 2 character typos.
Minyong
Photo of md5-759093f8dd119c1eccc94be20b2fc247
Minyong
09:52 AM
Got it. We would like to take some speed tradeoff as we are not using searchasyoufype -- please let me know when it's ready try it out.

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Phrase Match Problem in Typesense Version 0.24.0rcn25

Robert was unsure about correct phrase match usage in Typesense. After providing Kishore Nallan with necessary data, Kishore Nallan was able to replicate the issue. Robert shared a Github link for further tracking, where Kishore Nallan responded later.

9
13mo

Resolving TypeSense Query Query Confusion

Nelson didn't understand why a query for "hong kong" returned "singapore". Jason recommended changing a search parameter, then explained how TypeSense attempts to find results when exact matches aren't available. Kishore Nallan further clarified the issue and Jason and Kishore Nallan mentioned changes in the upcoming version to tackle this.

5

25
29mo

Issues with Repeated Words and Hyphen Queries in Typesense API

JinW discusses issues with repeated word queries and hyphen-containing queries in Typesense. Kishore Nallan offers possible solutions. During the discussion, Mr seeks advice on `token_separators` and how to send custom headers. Issues remain with repeated word queries.

8

43
25mo

Resolving Multilingual Search Function in Typesense Software

Bill is having difficulty with multilingual search functionality in Typesense software. Developer Kishore Nallan suggested setting a language locale and provided a demo build. The build solution had some issues, and after multiple rounds of software updates and troubleshooting, the problem still persists.

2

89
25mo

Issue with Typo Correction/Prefix Search and the Role of max_candidates

John noticed inconsistent search results based on max_candidates settings, and Kishore Nallan clarified its role for multi-word queries. They resolved that increasing max_candidates ensures the query isn't prematurely limited.

2

10
18mo