TLDR Arad inquires about managing a Persian character equivalence in Typesense. Jason advises adding the unicode character to `token_separators` but accepts that single byte characters are only supported currently. A GitHub issue was created.
Interesting! TIL... Could you try adding the unicode character for that to `token_separators` when creating the collection schema?
Jason Yeah that's actually pretty common in Persian :sweat_smile: I think it exists in Arabic too but I'm not totally sure. Compound words usually have this characteristic, so a word like "مهمتر" (meaning "more important") could be written like that, or like "مهمتر" or "مهم تر" and all three would be correct/common, although the "most correct" one is the one that uses a ZWNJ (books that are well-edited use ZWNJ for example), But that doesn't stop people from putting a regular space or no space between the two parts of the word. That's still all-too-common in day-to-day typing — although some people do use ZWNJs in regular typing as well, but that's because of our OCD! I did actually think about adding ZWNJ to `token_separators` , but would that also take care of the fact that it should be considered equivalent to "nothing" too?
Yeah token separators won't be matched on...
But looks like we currently only support single byte characters in token separators...
Could you create a GitHub issue for this specifying the Unicode codes for the characters we should split on, just like space?
Done:
Thank you! Could you also add an example that we can use to validate a potential solution? "Eg: searching for X currently does not return XYZ"
Sure, will do.
Arad
Mon, 09 Oct 2023 16:36:17 UTCMy dataset is in Persian, and there's a commonly-used character in Persian called a "", which Typesense should actually consider equivalent to both a regular whitespace, AND also no whitespace. For example, "می کنم" (with a regular whitespace) is equivalent to both "میکنم" (with no whitespace) and "میکنم" (with a zero-width non-joiner).
Is there any way to express this sort of thing in Typesense? Thanks.