Handling Zero-width Non-joiner in Typesense

TLDR Arad inquires about managing a Persian character equivalence in Typesense. Jason advises adding the unicode character to `token_separators` but accepts that single byte characters are only supported currently. A GitHub issue was created.

Photo of Arad
Arad
Mon, 09 Oct 2023 16:36:17 UTC

My dataset is in Persian, and there's a commonly-used character in Persian called a "", which Typesense should actually consider equivalent to both a regular whitespace, AND also no whitespace. For example, "می کنم" (with a regular whitespace) is equivalent to both "میکنم" (with no whitespace) and "می‌کنم" (with a zero-width non-joiner). Is there any way to express this sort of thing in Typesense? Thanks.

Photo of Jason
Jason
Mon, 09 Oct 2023 16:38:25 UTC

Interesting! TIL... Could you try adding the unicode character for that to `token_separators` when creating the collection schema?

Photo of Arad
Arad
Mon, 09 Oct 2023 16:45:18 UTC

Jason Yeah that's actually pretty common in Persian :sweat_smile: I think it exists in Arabic too but I'm not totally sure. Compound words usually have this characteristic, so a word like "مهم‌تر" (meaning "more important") could be written like that, or like "مهمتر" or "مهم تر" and all three would be correct/common, although the "most correct" one is the one that uses a ZWNJ (books that are well-edited use ZWNJ for example), But that doesn't stop people from putting a regular space or no space between the two parts of the word. That's still all-too-common in day-to-day typing — although some people do use ZWNJs in regular typing as well, but that's because of our OCD! I did actually think about adding ZWNJ to `token_separators` , but would that also take care of the fact that it should be considered equivalent to "nothing" too?

Photo of Jason
Jason
Mon, 09 Oct 2023 16:48:18 UTC

Yeah token separators won't be matched on...

Photo of Jason
Jason
Mon, 09 Oct 2023 16:48:37 UTC

But looks like we currently only support single byte characters in token separators...

Photo of Jason
Jason
Mon, 09 Oct 2023 16:49:31 UTC

Could you create a GitHub issue for this specifying the Unicode codes for the characters we should split on, just like space?

Photo of Arad
Arad
Mon, 09 Oct 2023 17:21:40 UTC

Done:

Photo of Jason
Jason
Mon, 09 Oct 2023 17:29:59 UTC

Thank you! Could you also add an example that we can use to validate a potential solution? "Eg: searching for X currently does not return XYZ"

Photo of Arad
Arad
Mon, 09 Oct 2023 17:47:45 UTC

Sure, will do.