Hello, at my company we intend to use TypeSense fo...
# community-help
y
Hello, at my company we intend to use TypeSense for searching content of a LOT of languages, and I mean including ancient languages that are no longer in use in the recent centuries. My question is how does TypeSense perform regarding that? I'm fine with exact matching for the rarer languages, as long as it searches them at all
j
Do these languages have Unicode character representations and do they have spaces between words?
y
Yes, and no not necessarily
Well I mean for some characters we may be inventing our own font that will use some of the user-defined Unicode regions or whatever, but it's still some unicode codepoint
In fact CJK languages don't have spaces between words either
j
Right, so if a language has Unicode characters and has spaces between words, then it will out of the box with Typesense. For CJK we’ve had to add special tokenizer to handle them. So other languages that don’t have spaces between words would also need special tokenizers and won’t work out of the box
y
Well CJK's tokenizer does what, approximate word splitting locations?
Like a tokenizer's purpose is to convert text into an array of words, right?
j
Yup
y
Let's say my tokenizer is only correct 60% of the time about where it splits words, would that be sufficient enough for basic searching capabilities?
j
You would have to define what “sufficient” means though…
Side note: we don’t support completely user defined tokenizers at the moment
y
Basically for those exotic languages I'm fine with just exact matching
Yeah I can fork typesense that's fine
j
Just remembered this - You might want to also consider using the pre_segmented_query parameter
And then apply your tokenizer in your application side, and index pre-tokenized data into Typesense
y
oh that sounds much easier
I mean generally speaking what would happen if I insert a non-tokenized document, because for a start I'll probably have no custom tokenizers and we'll be adding them gradually depending on language popularity, so would those documents be unsearchable at all or what?
Like a giant wall of unicode codepoints with no spaces between them
j
By default we’ll try using the English tokenizer
And if it’s outside the English character space, I think we leave the text as is, except for accented characters which we normalize
y
So the outcome would be what, unsearchable document? Or would Typesense at least try exact matching?
j
Yeah it should try exact matching Unicode character by character
y
Sounds pretty good, thank you very much
👍 1