#community-help

Custom Token Separator for Typesense Number and Unit Queries

TLDR Juan asked about custom token separators in Typesense for numbers and units. Jason suggested pre-processing the data or using ML entity extraction, while Gustavo mentioned GPT-4 for entity extraction.

Powered by Struct AI

1

11
3mo
Solved
Join the chat
Jun 23, 2023 (3 months ago)
Juan
Photo of md5-4749926c7aa50d4650e8f8bda9827dc6
Juan
08:56 PM
Hello guys, thank you for such a wonderful tool that Typesense is! I was wondering if there is any way to make a custom token separator in my collection so that numbers are separated from words. My use case is the following:

I have a really big collection (1.6m records) with 10 fields that contain text info about some buy orders. In some of those string fields, there are sentences that contain medicine specifications such as dose amount, for example "omeprazole 40mg". The thing is i can't control how it is written on data source and it is not consistent, e.g sometimes it can say "omeprazole 40mg" or ""omeprazole 40 mg"" or "omeprazole40mg". When i search for "q=omeprazole 40" i want all of those records to be returned, but i am only getting those that have "40 mg" separated due to how indexing works.

So the question is, can I have a custom token separator that indexes the word "40mg" into 2 tokens "40" and "mg"? or any number and measure unit like "25ml" so that i can query just the number and get the result
08:59
Juan
08:59 PM
So far I have been able to achieve this by querying for "omeprazole 40mg" and using the split_join_tokens param (even tho i don't want to split the word omeprazole, only the 40mg)

or the filter_by field, doing stuff like "filter_by=characteristics: [40, 40mg] || name: [40, 40mg]"
But i feel this is kind of cumbersome and can lead to issues
Any recommendations?
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
09:24 PM
What if you do that sanitization when you're indexing?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:29 PM
Typesense only supports customizing the tokenization based on special characters (and spaces by default).

If these are standard units of measure, I’d also recommend what Gustavo mentioned - you can do this tokenization before indexing the data into Typesense. Look for a number followed by a unit of measure (mg, ml, tab, etc) and add a space before and index that it in Typesense.
09:30
Jason
09:30 PM
> When i search for “q=omeprazole 40” i want all of those records to be returned, but i am only getting those that have “40 mg” separated due to how indexing works.
By default omeprazole 40 should return omeprazole 40mg (since it’s a prefix match) and omeprazole 40 mg
Juan
Photo of md5-4749926c7aa50d4650e8f8bda9827dc6
Juan
09:52 PM
Thank you guys, I will look into how i can do that sanitization beforehand, i keep the data in a postgres materialized view and sync it with airbyte. And Jason you are right, i made a mistake in my original question because the query i am using is actually omeprazole vial 40 -"open contract" (vial was interfering with the search and that's why i wasn't getting the correct results)
09:54
Juan
09:54 PM
btw, is there a way to exclude a substring (not exact match with full string) from a specific field? instead of excluding it from the whole query
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:55 PM
Another thing to consider for your use-case could be to run it through an entity detection ML algorithm to extract key pieces of information and index them in separate fields: https://developers.google.com/ml-kit/language/entity-extraction

Edit: looks like the Google one only supports a limited set of entities. May be there are other APIs like this for units of measure
09:56
Jason
09:56 PM

1

Juan
Photo of md5-4749926c7aa50d4650e8f8bda9827dc6
Juan
09:57 PM
thank you so much! i will look into that 😄
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
11:00 PM
Pretty sure GPT-4 can be used reliably for entity extraction in this case.