#community-help

Issues and Improvements in Typesense for Searching Japanese Text

TLDR sandbox co ltd . ueda detailed issues with Typesense when handling Japanese text, such as incorrect highlighting and ordering issues. Kishore Nallan acknowledged these problems, suggested an upcoming version might address some issues, and reassured space ordering is on the roadmap.

Powered by Struct AI
Nov 29, 2021 (26 months ago)
sandbox co ltd . ueda
Photo of md5-030f97e240e4ff8d5f5d94c4a43d274c
sandbox co ltd . ueda
01:37 AM
Hello. I am Japanese.
Since Japanese is a language that doesn't include spaces between sentence, I include spaces between every word to use Typesense.
For example, the person's name "おおぬき" is "お" "お" "ぬ" "き"
The accuracy of the search will be reduced, but some errors are allowed.

Of course, when searching, the word entered by the user is decomposed character by character and the query is executed.

If I search for a word in which the same letter is repeated twice, the word immediately following it will not be marked.

example:
{q:'お お ぬ き'}

* There is a space between all the characters.

Result is


matched_tokens: ['お','お','き']
snippet: "<mark> お </ mark> <mark> お </ mark> ぬ <mark> き </ mark>"

* ぬ is not marked *

It will be. The "ぬ" that comes after the "お" that has been repeated twice is not marked.
The same was true for other words. For example, if you search for i i t c, only c will not be marked.

Is there a solution? Thank you very much.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:40 AM
👋 Thank you for creating this issue. May I Know what version of TS you are using? We have improved the highlighting on the next version 0.22 (under RC now). Can you check against this Docker build? typesense/typesense:0.22.0.rcs40
sandbox co ltd . ueda
Photo of md5-030f97e240e4ff8d5f5d94c4a43d274c
sandbox co ltd . ueda
01:46 AM
Kishore Nallan The version I'm using is V0.21.0.
I'm using Typesense Cloud.
01:53
sandbox co ltd . ueda
01:53 AM
Kishore Nallan
I'm not sure about the server. I'm developing a front end using Vue.js.
I don't know how to use Docker either.
I am looking forward to the next version.

Since all words are decomposed and saved, the accuracy of the search is inferior to the original usage.

Is there an option that allows you to specify the order of words when searching?

For example, even if I search for "た な か", I'm a little troubled because "な か た" in a different order will hit.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:21 AM
We don't have a way to enforce ordering since we have so far focused on English where this is not that big of a problem. But this is on our roadmap to address.

If you can tell me your TS Cluster ID, we can upgrade the cluster to the latest stable RC build to see if that helps with highlighting.
02:21
Kishore Nallan
02:21 AM
You can also email us: [email protected]
sandbox co ltd . ueda
Photo of md5-030f97e240e4ff8d5f5d94c4a43d274c
sandbox co ltd . ueda
03:21 AM
Kishore Nallan
Thank you very much. Because Japanese, Chinese, and Korean (abbreviated as CJK) are unique languages,
It's a difficult road, but I'm looking forward to future developments.
https://en.wikipedia.org/wiki/CJK_characters

The cluster I'm currently using is already in production, so I can't upgrade it right away.
There are many problems with searching Japanese, and I am also looking for various solutions here.

Typesense, which does not require a server engineer and allows full-text search, is a very useful tool for small engineers like me.
I will continue to support you, although I will do my best in the future.