Hi everybody, with the last update, Typesense also...
# community-help
s
Hi everybody, with the last update, Typesense also supports multi-lingual tokenization on a field level – which is great. I, unfortunately, have datasets that contain mixed languages. So I cannot specify the language in the collection schema beforehand. Is there a workaround for that? Or is it planned that the standard tokenizer also supports logographic languages? Thanks 🙏
k
Do you have documents, each of which belong to a different language or a text that can contain multiple languages within the same paragraph?
s
It could be both but documents in different languages is more likely. The biggest problem: I don’t know the language of the document. Any clever ideas? 😄
k
This needs language detection. If you have a handful of languages it should be easy to do by looking at the unicode values of the first few characters of the field. This can be done client side and be used to choose the field to be indexed (assuming each field in typesense will have separate locale)
s
Thanks for your ideas. All our documents are user generated and we don’t know the language of these documents (they could be any language possible, or even mixed languages). I’ll think about it, but probably it’s best to wait for a universal tokenizer. Where can I submit that as a feature request?
k
I don't think there can be a universal tokenizer. Languages require context for tokenization so at best, we could auto detect a language so that no specific locale field is required to be mentioned. You can submit this as a feature request on our github issues.
s
Yeah, I think that would be a great abstraction layer. Detecting the language of a field, if it matches a language specific tokenizer use that tokenizer, else use fallback tokenizer. I’ll submit it 👍 thanks!!
👍 1
@Kishore Nallan: Btw, really impressive progress with v0.24.0. Keep up the great work 🚀
k
Thank you!