Hi everybody with the last update Typesense also supports mu typesense #community-help

Hi everybody, with the last update, Typesense also...

Steffen Bleher

03/12/2023, 9:26 AM

Hi everybody, with the last update, Typesense also supports multi-lingual tokenization on a field level – which is great. I, unfortunately, have datasets that contain mixed languages. So I cannot specify the language in the collection schema beforehand. Is there a workaround for that? Or is it planned that the standard tokenizer also supports logographic languages? Thanks 🙏

Kishore Nallan

03/12/2023, 9:53 AM

Do you have documents, each of which belong to a different language or a text that can contain multiple languages within the same paragraph?

Steffen Bleher

03/12/2023, 4:12 PM

It could be both but documents in different languages is more likely. The biggest problem: I don’t know the language of the document. Any clever ideas? 😄

Kishore Nallan

03/12/2023, 4:20 PM

This needs language detection. If you have a handful of languages it should be easy to do by looking at the unicode values of the first few characters of the field. This can be done client side and be used to choose the field to be indexed (assuming each field in typesense will have separate locale)

Steffen Bleher

03/13/2023, 10:17 AM

Thanks for your ideas. All our documents are user generated and we don’t know the language of these documents (they could be any language possible, or even mixed languages). I’ll think about it, but probably it’s best to wait for a universal tokenizer. Where can I submit that as a feature request?

Kishore Nallan

03/13/2023, 10:20 AM

I don't think there can be a universal tokenizer. Languages require context for tokenization so at best, we could auto detect a language so that no specific locale field is required to be mentioned. You can submit this as a feature request on our github issues.

Steffen Bleher

03/13/2023, 10:24 AM

Yeah, I think that would be a great abstraction layer. Detecting the language of a field, if it matches a language specific tokenizer use that tokenizer, else use fallback tokenizer. I’ll submit it 👍 thanks!!

👍 1

Steffen Bleher

03/13/2023, 10:31 AM

@Kishore Nallan: Btw, really impressive progress with v0.24.0. Keep up the great work 🚀

Kishore Nallan

03/13/2023, 10:41 AM

Thank you!

Open in Slack

Previous Next