Hi What s the best practice to implement billingual text E g typesense #community-help

Hi. What's the best practice to implement billingu...

Juri Uschakow

05/12/2022, 8:16 AM

Hi. What's the best practice to implement billingual text? E.g I have a quebec blog platform, and people write their blogposte (titel + text) in english and in french. Users then can, while reading, always switch between the languages. What's the best practice to implement the indexing? E.g the user has 2 Input fields, one input for the english title and one input for the french title. How should I index that in typesense? Should I, in the backend, just join together the titles and put them as one string into typesense? E.g the title is "10 best parks in quebec city" and "10 meilleurs parcs à québec", I just join them together to "10 best parks in quebec city 10 meilleurs parcs à québec" and insert that into typesense. My problem with this approach is: What about billingual titles in chinese and english? Since these are not two latin languages anymore but one cantonese and one latin, I wonder if that could cause any problems. --- Google Search, also within Google Maps, has the result I want to have too. You can search for this restaurant in japanese language, but also in english/latin language.

Kishore Nallan

05/12/2022, 8:18 AM

Best to create a field for each language. So

title_en

and

title_fr

etc. While searching, you will likely know the user language preference and so you can choose the appropriate fields to query.

❤️ 1

Juri Uschakow

05/12/2022, 8:30 AM

@Kishore Nallan this was my idea too since it's the easiest solution, but i really don't like this approach, because I wont have the magic google multi language search anymore. Because there are many regions on earth where people speak multiple languages. In Switzerland German + French + Italian, in Canada often English + French. In India often English + Hindi + Bengali + Urdu + many more. It is a bad experience for the user to always have to switch manually between languages. :/ Will the first idea (joining up billingual text) cause problems? If I have a title "Best pizzas in Tokyo 東京で最高のピザ" will that cause problems? I have a friend who lives since 10 years in tokyo, and he often searches, if its easy, in japanese, but still quite often in English, because Japanese is a hard language.

Kishore Nallan

05/12/2022, 8:44 AM

Provided the text to be indexed is "tokenized" by spaces, i.e. words must be split into tokens. Some languages like Japanese/Chinese don't have segmentation in natural form and is done as a pre-processing step to indexing.

❤️ 1

Kishore Nallan

05/12/2022, 8:44 AM

Joining up all locales would also mean that you have to disable typo tolerance entirely because if you try to do typo tolerance on unicode bytes then it won't be accurate for non-English locales.

❤️ 1

Juri Uschakow

05/12/2022, 8:56 AM

thank you very much!! What do you think about a third hybrid solution which mixes first and second solution: For all languages that do not require tokenization I join/mix them, for better UX, and languages that require tokenization (or other preprocessing) have bad luck, won't get mixed, and will have to exist as a separate field on their own. - Will that work? - do you maybe have a list somewhere where I can find all typesense languages that require tokenization/preprocessing? /// edit: If I mix english german and french, do I still have to disable typo tolerance?

Kishore Nallan

05/12/2022, 9:07 AM

I think you can mix Eng/German/French etc. We only have experimental support for Japanese/Chinese etc. If you just index them as space separate and don't enable typos it should work fine.

❤️ 1

🙏 1

Juri Uschakow

05/12/2022, 9:14 AM

Thanks!! sorry for being so dumb but i don't understand that with the typo now 🤣 I thought I had to disable typo tolerance only if I would mix english with japanese. If I mix english with german, I would have to disable typo tolerance too?

Kishore Nallan

05/12/2022, 9:16 AM

No, not with english/german. I meant the typo disabling for Japanese/Chinese locales.

❤️ 1

Juri Uschakow

05/12/2022, 9:19 AM

Ohh, but I don't mix Japanese with Chinese. Do you mean that I generally have to disable typo for Japanese because it is still experimental?

Juri Uschakow

05/12/2022, 9:20 AM

BTW is there somewhere a list where we can see which languages are stable, which are unstable/experimental, and which are not supported yet?

Kishore Nallan

05/12/2022, 9:23 AM

What's supported right now is in flux and has a lot of nuances (for people who wanted something specific). My recommendation is to pre-segment (i.e. tokenize) non-space oriented languages and store them in separate fields, and query them like you normally but with num_typos: 0 on those fields.

❤️ 1

Kishore Nallan

05/12/2022, 9:24 AM

We plan to tackle multi language features later this year.

❤️ 1

Kishore Nallan

05/12/2022, 9:24 AM

Where we can handle segmentation ourselves.

❤️ 1

Jms

06/28/2024, 7:44 AM

Hi @Kishore Nallan By any chance, did you guys build some multi language features already?

Kishore Nallan

06/28/2024, 7:50 AM

You still need to mark fields explicitly with a locale. But we have fixed a lot of issues around non-English highlighting issues, etc.

Jms

06/28/2024, 8:00 AM

Thank you for the quick response. @Kishore Nallan In my app I have a ton of different languages and the user could be looking for all of them. Does that mean I have to add a title for each language? (eg:

title_en,title_fr, title_se, title_es, title_pl

and many more) And then just search through ALL of them with:

query_by: 'title_en,title_fr,title_se,title_es,title_pl'

Or is there a better approach then this?

Kishore Nallan

06/28/2024, 8:01 AM

Yes, correct. Though if all of them are European languages, just using

en

local should work fine 90-95% of the times.

Jms

06/28/2024, 8:02 AM

And what to do with non-european languages?

Jms

06/28/2024, 8:19 AM

just using en local should work fine 90-95% of the times.

With this you mean that the

title_

prefix is not needed? And that this should work fine?

query_by: 'en,fr,se,es,pl'

Kishore Nallan

06/28/2024, 8:32 AM

What I meant is you could put them all in a single aggregated field for searching. However, you will lose the ability to do proper highlighting.

👍 1

Kishore Nallan

06/28/2024, 8:33 AM

Copy code

query_by: 'title_en,title_fr,title_se,title_es,title_pl'

Is the recommended approach.

Jms

06/28/2024, 8:44 AM

Thanks a lot!

Open in Slack

Previous Next