Issues and Improvements in Typesense for Searching Japanese Text
TLDR sandbox co ltd . ueda detailed issues with Typesense when handling Japanese text, such as incorrect highlighting and ordering issues. Kishore Nallan acknowledged these problems, suggested an upcoming version might address some issues, and reassured space ordering is on the roadmap.
Nov 29, 2021 (26 months ago)
sandbox co ltd . ueda
01:37 AMSince Japanese is a language that doesn't include spaces between sentence, I include spaces between every word to use Typesense.
For example, the person's name "おおぬき" is "お" "お" "ぬ" "き"
The accuracy of the search will be reduced, but some errors are allowed.
Of course, when searching, the word entered by the user is decomposed character by character and the query is executed.
If I search for a word in which the same letter is repeated twice, the word immediately following it will not be marked.
example:
{q:'お お ぬ き'}
* There is a space between all the characters.
Result is
matched_tokens: ['お','お','き']
snippet: "<mark> お </ mark> <mark> お </ mark> ぬ <mark> き </ mark>"
* ぬ is not marked *
It will be. The "ぬ" that comes after the "お" that has been repeated twice is not marked.
The same was true for other words. For example, if you search for i i t c, only c will not be marked.
Is there a solution? Thank you very much.
Kishore Nallan
01:40 AMtypesense/typesense:0.22.0.rcs40
sandbox co ltd . ueda
01:46 AMI'm using Typesense Cloud.
sandbox co ltd . ueda
01:53 AMI'm not sure about the server. I'm developing a front end using Vue.js.
I don't know how to use Docker either.
I am looking forward to the next version.
Since all words are decomposed and saved, the accuracy of the search is inferior to the original usage.
Is there an option that allows you to specify the order of words when searching?
For example, even if I search for "た な か", I'm a little troubled because "な か た" in a different order will hit.
Kishore Nallan
02:21 AMIf you can tell me your TS Cluster ID, we can upgrade the cluster to the latest stable RC build to see if that helps with highlighting.
Kishore Nallan
02:21 AMsandbox co ltd . ueda
03:21 AMThank you very much. Because Japanese, Chinese, and Korean (abbreviated as CJK) are unique languages,
It's a difficult road, but I'm looking forward to future developments.
https://en.wikipedia.org/wiki/CJK_characters
The cluster I'm currently using is already in production, so I can't upgrade it right away.
There are many problems with searching Japanese, and I am also looking for various solutions here.
Typesense, which does not require a server engineer and allows full-text search, is a very useful tool for small engineers like me.
I will continue to support you, although I will do my best in the future.
Typesense
Indexed 3015 threads (79% resolved)
Similar Threads
Resolving Multilingual Search Function in Typesense Software
Bill is having difficulty with multilingual search functionality in Typesense software. Developer Kishore Nallan suggested setting a language locale and provided a demo build. The build solution had some issues, and after multiple rounds of software updates and troubleshooting, the problem still persists.
Issues with Repeated Words and Hyphen Queries in Typesense API
JinW discusses issues with repeated word queries and hyphen-containing queries in Typesense. Kishore Nallan offers possible solutions. During the discussion, Mr seeks advice on `token_separators` and how to send custom headers. Issues remain with repeated word queries.
Troubleshooting Prefix Search and Exact Match Results in Typesense v0.22.0.rcs24
JinW has trouble with typesense search, notably with prefix search and exact matches. Jason clarified some things while Kishore Nallan requested more examples and suggested using a different software version. Harisaran also recommended a solution.