#community-help

Korean Language Classification and Support in Typesense

TLDR Pete asked if Typesense classified Korean as logographic. Jason clarified that the software works with all languages that use word spaces, and that special support for Korean was added recently. Advised Pete to test the system with a Korean dataset.

Powered by Struct AI

1

1

Nov 09, 2022 (13 months ago)
Pete
Photo of md5-76926e1c8a72128d7fee4a61950cfd89
Pete
09:09 PM
Does Typesense consider Korean logographic? I ask this because I am a fluent speaker of this language and it is a syllabic alphabet, quite different from Chinese, but am hearing conflicting information about how Typesense classifies it.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:14 PM
May be we butchered the exact terminology for “logographic”, but we really meant to convey that Typesense works for any language that uses spaces between words. Korean doesn’t use spaces between words (and so we classified it as logographic), and have had to add specialized support for it in recent versions

1

09:15
Jason
09:15 PM
I’d recommend using the latest RC build 0.24.0.rcn30 and setting the locale for each field, to use the improved Korean tokenizer
Pete
Photo of md5-76926e1c8a72128d7fee4a61950cfd89
Pete
09:17 PM
Cool, I'll try that. Thanks.

There are definitely spaces between words. I can see how it could seem different to English and romantic languages though.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:19 PM
I see! Would love to get your feedback on how it works with your Korean dataset. We don’t have native Korean speakers on the core team, so we entirely rely on community feedback to improve support

1