#community-help

Troubleshooting Typo Tolerance Issue with Typesense for Korean

TLDR Minyong informed Kishore Nallan about a typo tolerance issue in Typesense with Korean text. Kishore Nallan suggested adjusting the byte difference limit for Korean, but warned this could slow down the search function. Minyong approved testing the solution.

Powered by Struct AI
+11
raised_hands1
white_check_mark1
Oct 19, 2022 (11 months ago)
Minyong
Photo of md5-759093f8dd119c1eccc94be20b2fc247
Minyong
08:36 AM
[Typo tolerance for Korean]
Hello! I am using typesense for a search app. Our records have fields which are mixes of English and Korean text. I am trying to make the search as lenient as possible to increase recall — in that sense, typo tolerance is very important. However, typo tolerance doesn’t seem to work well for korean text. Could you take a look at this reproducible example?

https://gist.github.com/minyonglee/d0129025d04192d8f09f236f4d11165b
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:49 AM
👋 I don't know Korean. Can you please explain what you expect to see vs what you are seeing in the example above?
raised_hands1
Minyong
Photo of md5-759093f8dd119c1eccc94be20b2fc247
Minyong
08:52 AM
yes! summary:

q: '김철수님 부트캠프'
field: '김철수 부트캠프'
result: matched_tokens: [ '부트캠프' ]
expected: matched_tokens: [ '김철수님', '부트캠프' ]

english works well
q: 'Kevin Jordan'
field: 'Kev Jordan'
result: matched_tokens: [ 'Kev', 'Jordan' ]
expected: matched_tokens: [ 'Kev', 'Jordan' ]
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:14 AM
Okay so you are expecting 김철수님 to be a prefix match against 김철수?
+11
09:16
Kishore Nallan
09:16 AM
I think the issue is because Typesense allows upto 2 character typos, but this is based on bytes, so really it's 2 bytes of difference is allowed. With Korean, each character can be several bytes in unicode so the fuzzy matching is not working. We limit num_typos to 2 bytes because values greater than that is quite expensive.
white_check_mark1
Minyong
Photo of md5-759093f8dd119c1eccc94be20b2fc247
Minyong
09:28 AM
that makes sense. is there anything I can do to match those two words?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:31 AM
We can experimentally enable upto 6 byte differences for Korean.
09:32
Kishore Nallan
09:32 AM
But I've no idea how slow that will be. Since an average Hangul is 3 bytes in UTF-8, 6 bytes will be roughly equal to 2 character typos.
Minyong
Photo of md5-759093f8dd119c1eccc94be20b2fc247
Minyong
09:52 AM
Got it. We would like to take some speed tradeoff as we are not using searchasyoufype -- please let me know when it's ready try it out.