Hi. What's the best practice to implement billingu...
# community-help
j
Hi. What's the best practice to implement billingual text? E.g I have a quebec blog platform, and people write their blogposte (titel + text) in english and in french. Users then can, while reading, always switch between the languages. What's the best practice to implement the indexing? E.g the user has 2 Input fields, one input for the english title and one input for the french title. How should I index that in typesense? Should I, in the backend, just join together the titles and put them as one string into typesense? E.g the title is "10 best parks in quebec city" and "10 meilleurs parcs à québec", I just join them together to "10 best parks in quebec city 10 meilleurs parcs à québec" and insert that into typesense. My problem with this approach is: What about billingual titles in chinese and english? Since these are not two latin languages anymore but one cantonese and one latin, I wonder if that could cause any problems. --- Google Search, also within Google Maps, has the result I want to have too. You can search for this restaurant in japanese language, but also in english/latin language.
k
Best to create a field for each language. So
title_en
and
title_fr
etc. While searching, you will likely know the user language preference and so you can choose the appropriate fields to query.
❤️ 1
j
@Kishore Nallan this was my idea too since it's the easiest solution, but i really don't like this approach, because I wont have the magic google multi language search anymore. Because there are many regions on earth where people speak multiple languages. In Switzerland German + French + Italian, in Canada often English + French. In India often English + Hindi + Bengali + Urdu + many more. It is a bad experience for the user to always have to switch manually between languages. :/ Will the first idea (joining up billingual text) cause problems? If I have a title "Best pizzas in Tokyo 東京で最高のピザ" will that cause problems? I have a friend who lives since 10 years in tokyo, and he often searches, if its easy, in japanese, but still quite often in English, because Japanese is a hard language.
k
Provided the text to be indexed is "tokenized" by spaces, i.e. words must be split into tokens. Some languages like Japanese/Chinese don't have segmentation in natural form and is done as a pre-processing step to indexing.
❤️ 1
Joining up all locales would also mean that you have to disable typo tolerance entirely because if you try to do typo tolerance on unicode bytes then it won't be accurate for non-English locales.
❤️ 1
j
thank you very much!! What do you think about a third hybrid solution which mixes first and second solution: For all languages that do not require tokenization I join/mix them, for better UX, and languages that require tokenization (or other preprocessing) have bad luck, won't get mixed, and will have to exist as a separate field on their own. - Will that work? - do you maybe have a list somewhere where I can find all typesense languages that require tokenization/preprocessing? /// edit: If I mix english german and french, do I still have to disable typo tolerance?
k
I think you can mix Eng/German/French etc. We only have experimental support for Japanese/Chinese etc. If you just index them as space separate and don't enable typos it should work fine.
❤️ 1
🙏 1
j
Thanks!! sorry for being so dumb but i don't understand that with the typo now 🤣 I thought I had to disable typo tolerance only if I would mix english with japanese. If I mix english with german, I would have to disable typo tolerance too?
k
No, not with english/german. I meant the typo disabling for Japanese/Chinese locales.
❤️ 1
j
Ohh, but I don't mix Japanese with Chinese. Do you mean that I generally have to disable typo for Japanese because it is still experimental?
BTW is there somewhere a list where we can see which languages are stable, which are unstable/experimental, and which are not supported yet?
k
What's supported right now is in flux and has a lot of nuances (for people who wanted something specific). My recommendation is to pre-segment (i.e. tokenize) non-space oriented languages and store them in separate fields, and query them like you normally but with num_typos: 0 on those fields.
❤️ 1
We plan to tackle multi language features later this year.
❤️ 1
Where we can handle segmentation ourselves.
❤️ 1
j
Hi @Kishore Nallan By any chance, did you guys build some multi language features already?
k
You still need to mark fields explicitly with a locale. But we have fixed a lot of issues around non-English highlighting issues, etc.
j
Thank you for the quick response. @Kishore Nallan In my app I have a ton of different languages and the user could be looking for all of them. Does that mean I have to add a title for each language? (eg:
title_en,title_fr, title_se, title_es, title_pl
and many more) And then just search through ALL of them with:
query_by: 'title_en,title_fr,title_se,title_es,title_pl'
Or is there a better approach then this?
k
Yes, correct. Though if all of them are European languages, just using
en
local should work fine 90-95% of the times.
j
And what to do with non-european languages?
just using en local should work fine 90-95% of the times.
With this you mean that the
title_
prefix is not needed? And that this should work fine?
query_by: 'en,fr,se,es,pl'
k
What I meant is you could put them all in a single aggregated field for searching. However, you will lose the ability to do proper highlighting.
👍 1
Copy code
query_by: 'title_en,title_fr,title_se,title_es,title_pl'
Is the recommended approach.
j
Thanks a lot!