Hello at my company we intend to use TypeSense for searching typesense #community-help

Hello, at my company we intend to use TypeSense fo...

Yusuf Sultan

05/05/2024, 7:27 AM

Hello, at my company we intend to use TypeSense for searching content of a LOT of languages, and I mean including ancient languages that are no longer in use in the recent centuries. My question is how does TypeSense perform regarding that? I'm fine with exact matching for the rarer languages, as long as it searches them at all

Jason Bosco

05/05/2024, 2:19 PM

Do these languages have Unicode character representations and do they have spaces between words?

Yusuf Sultan

05/06/2024, 2:25 AM

Yes, and no not necessarily

Yusuf Sultan

05/06/2024, 2:25 AM

Well I mean for some characters we may be inventing our own font that will use some of the user-defined Unicode regions or whatever, but it's still some unicode codepoint

Yusuf Sultan

05/06/2024, 2:26 AM

In fact CJK languages don't have spaces between words either

Jason Bosco

05/06/2024, 2:29 AM

Right, so if a language has Unicode characters and has spaces between words, then it will out of the box with Typesense. For CJK we’ve had to add special tokenizer to handle them. So other languages that don’t have spaces between words would also need special tokenizers and won’t work out of the box

Yusuf Sultan

05/06/2024, 2:29 AM

Well CJK's tokenizer does what, approximate word splitting locations?

Yusuf Sultan

05/06/2024, 2:31 AM

Like a tokenizer's purpose is to convert text into an array of words, right?

Jason Bosco

05/06/2024, 2:31 AM

Yup

Yusuf Sultan

05/06/2024, 2:32 AM

Let's say my tokenizer is only correct 60% of the time about where it splits words, would that be sufficient enough for basic searching capabilities?

Jason Bosco

05/06/2024, 2:33 AM

You would have to define what “sufficient” means though…

Jason Bosco

05/06/2024, 2:33 AM

Side note: we don’t support completely user defined tokenizers at the moment

Yusuf Sultan

05/06/2024, 2:33 AM

Basically for those exotic languages I'm fine with just exact matching

Yusuf Sultan

05/06/2024, 2:33 AM

Yeah I can fork typesense that's fine

Jason Bosco

05/06/2024, 2:35 AM

Just remembered this - You might want to also consider using the pre_segmented_query parameter

Jason Bosco

05/06/2024, 2:36 AM

And then apply your tokenizer in your application side, and index pre-tokenized data into Typesense

Yusuf Sultan

05/06/2024, 2:36 AM

oh that sounds much easier

Yusuf Sultan

05/06/2024, 2:37 AM

I mean generally speaking what would happen if I insert a non-tokenized document, because for a start I'll probably have no custom tokenizers and we'll be adding them gradually depending on language popularity, so would those documents be unsearchable at all or what?

Yusuf Sultan

05/06/2024, 2:38 AM

Like a giant wall of unicode codepoints with no spaces between them

Jason Bosco

05/06/2024, 2:39 AM

By default we’ll try using the English tokenizer

Jason Bosco

05/06/2024, 2:39 AM

And if it’s outside the English character space, I think we leave the text as is, except for accented characters which we normalize

Yusuf Sultan

05/06/2024, 2:40 AM

So the outcome would be what, unsearchable document? Or would Typesense at least try exact matching?

Jason Bosco

05/06/2024, 2:45 AM

Yeah it should try exact matching Unicode character by character

Yusuf Sultan

05/06/2024, 2:46 AM

Sounds pretty good, thank you very much

👍 1

Open in Slack

Previous Next