#random

Typesense Multilingual Document Search

TLDR Mridul needed to search across source and translated documents. Jason and Sergio suggested putting translations in the same document, using regex with specific fields for different locales, and rebuilding the collection when adding fields.

Powered by Struct AI

1

16
4mo
Solved
Join the chat
May 18, 2023 (4 months ago)
Mridul
Photo of md5-934a3c15b1be03be70406428fd962f39
Mridul
01:26 AM
Hi all, our documents are structured as source and translated documents. When someone searches, we need to search across the source and their translations
However, when the result is returned, we need it to return both the source and all its translated documents (even the one not in the match), and count a set them as 1 document
Is this possible within Typesense?
All the related documents have a relationId field which is the same
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:05 AM
I’d recommend putting all translations inside the same document in Typesense, when indexing
02:05
Jason
02:05 AM
Eg: { fieldA_en, fieldA_fr, fieldA_de, fieldB_en, fieldB_fr, fieldB_de }
Mridul
Photo of md5-934a3c15b1be03be70406428fd962f39
Mridul
03:55 AM
Thanks Jason…I was trying to avoid it because the data sync from DB would not be just a single row updates …but maybe that’s the only way
May 19, 2023 (4 months ago)
Sergio
Photo of md5-19856b8e92142bdd0747d7a3706736c8
Sergio
09:08 AM
We rolled out this implementation too title_.*
By using wildcards we could expand to future locales without major collection changes 🙂
Mridul
Photo of md5-934a3c15b1be03be70406428fd962f39
Mridul
09:09 AM
Oh this is great…makes things so much better …can i index it as such?
Sergio
Photo of md5-19856b8e92142bdd0747d7a3706736c8
Sergio
09:11 AM
You then index title_en or title_es and a new field is generated.
09:12
Sergio
09:12 AM
And we have some logic to query Typesense by the locale the user requires
Mridul
Photo of md5-934a3c15b1be03be70406428fd962f39
Mridul
09:15 AM
How can I pass different locales when I define fields with regex ? Eg: text_chinese needs zh tokenizer , but text_en needs a different one
09:17
Mridul
09:17 AM
We can always define new collection and start using that when we add a new language, however that would mean a complete reindexing
Sergio
Photo of md5-19856b8e92142bdd0747d7a3706736c8
Sergio
09:17 AM
For those you would need to define them specifically
09:17
Sergio
09:17 AM
Order matter, so you could to
title_ja -> ja
title_zh -> zh
title_.* -> generic

1

Mridul
Photo of md5-934a3c15b1be03be70406428fd962f39
Mridul
09:18 AM
Okay…seeing that there are only a handful of tokenizers currently, we can do a comprehensive one without much overhead
09:19
Mridul
09:19 AM
Thanks a tonne @Sergio
Sergio
Photo of md5-19856b8e92142bdd0747d7a3706736c8
Sergio
09:21 AM
Currently we rebuild the whole collection when adding fields, and re index the whole database.
There is an option to add a field to the collection, but still requires indexing all the data.
Since there is no "collection migration management" we just avoid conflicts by recreating everything and then moving the alias.