Hello guys thank you for such a wonderful tool that Typesens typesense #community-help

Hello guys, thank you for such a wonderful tool th...

Juan Vera

06/23/2023, 8:56 PM

Hello guys, thank you for such a wonderful tool that Typesense is! I was wondering if there is any way to make a custom token separator in my collection so that numbers are separated from words. My use case is the following: I have a really big collection (1.6m records) with 10 fields that contain text info about some buy orders. In some of those string fields, there are sentences that contain medicine specifications such as dose amount, for example "omeprazole 40mg". The thing is i can't control how it is written on data source and it is not consistent, e.g sometimes it can say "omeprazole 40mg" or ""omeprazole 40 mg"" or "omeprazole40mg". When i search for "q=omeprazole 40" i want all of those records to be returned, but i am only getting those that have "40 mg" separated due to how indexing works. So the question is, can I have a custom token separator that indexes the word "40mg" into 2 tokens "40" and "mg"? or any number and measure unit like "25ml" so that i can query just the number and get the result

Juan Vera

06/23/2023, 8:59 PM

So far I have been able to achieve this by querying for "omeprazole 40mg" and using the split_join_tokens param (even tho i don't want to split the word omeprazole, only the 40mg) or the filter_by field, doing stuff like "filter_by=characteristics: [40, 40mg] || name: [40, 40mg]" But i feel this is kind of cumbersome and can lead to issues Any recommendations?

Gustavo

06/23/2023, 9:24 PM

What if you do that sanitization when you're indexing?

Jason Bosco

06/23/2023, 9:29 PM

Typesense only supports customizing the tokenization based on special characters (and spaces by default). If these are standard units of measure, I’d also recommend what Gustavo mentioned - you can do this tokenization before indexing the data into Typesense. Look for a number followed by a unit of measure (mg, ml, tab, etc) and add a space before and index that it in Typesense.

Jason Bosco

06/23/2023, 9:30 PM

When i search for “q=omeprazole 40” i want all of those records to be returned, but i am only getting those that have “40 mg” separated due to how indexing works.

By default

omeprazole 40

should return

omeprazole 40mg

(since it’s a prefix match) and

omeprazole 40 mg

Juan Vera

06/23/2023, 9:52 PM

Thank you guys, I will look into how i can do that sanitization beforehand, i keep the data in a postgres materialized view and sync it with airbyte. And @Jason Bosco you are right, i made a mistake in my original question because the query i am using is actually

omeprazole vial 40 -"open contract"

(vial was interfering with the search and that's why i wasn't getting the correct results)

Juan Vera

06/23/2023, 9:54 PM

btw, is there a way to exclude a substring (not exact match with full string) from a specific field? instead of excluding it from the whole query

Jason Bosco

06/23/2023, 9:55 PM

Another thing to consider for your use-case could be to run it through an entity detection ML algorithm to extract key pieces of information and index them in separate fields: https://developers.google.com/ml-kit/language/entity-extraction Edit: looks like the Google one only supports a limited set of entities. May be there are other APIs like this for units of measure

Jason Bosco

06/23/2023, 9:56 PM

https://typesense-community.slack.com/archives/C01P749MET0/p1687557265042089?thread_ts=1687553818.709219&cid=C01P749MET0

No, it’s not possible to do this in Typesense for performance reasons.

👍 1

Juan Vera

06/23/2023, 9:57 PM

thank you so much! i will look into that 😄

Gustavo

06/23/2023, 11:00 PM

Pretty sure GPT-4 can be used reliably for entity extraction in this case.

Open in Slack

Previous Next