Hello guys, thank you for such a wonderful tool th...
# community-help
j
Hello guys, thank you for such a wonderful tool that Typesense is! I was wondering if there is any way to make a custom token separator in my collection so that numbers are separated from words. My use case is the following: I have a really big collection (1.6m records) with 10 fields that contain text info about some buy orders. In some of those string fields, there are sentences that contain medicine specifications such as dose amount, for example "omeprazole 40mg". The thing is i can't control how it is written on data source and it is not consistent, e.g sometimes it can say "omeprazole 40mg" or ""omeprazole 40 mg"" or "omeprazole40mg". When i search for "q=omeprazole 40" i want all of those records to be returned, but i am only getting those that have "40 mg" separated due to how indexing works. So the question is, can I have a custom token separator that indexes the word "40mg" into 2 tokens "40" and "mg"? or any number and measure unit like "25ml" so that i can query just the number and get the result
So far I have been able to achieve this by querying for "omeprazole 40mg" and using the split_join_tokens param (even tho i don't want to split the word omeprazole, only the 40mg) or the filter_by field, doing stuff like "filter_by=characteristics: [40, 40mg] || name: [40, 40mg]" But i feel this is kind of cumbersome and can lead to issues Any recommendations?
g
What if you do that sanitization when you're indexing?
j
Typesense only supports customizing the tokenization based on special characters (and spaces by default). If these are standard units of measure, I’d also recommend what Gustavo mentioned - you can do this tokenization before indexing the data into Typesense. Look for a number followed by a unit of measure (mg, ml, tab, etc) and add a space before and index that it in Typesense.
When i search for “q=omeprazole 40” i want all of those records to be returned, but i am only getting those that have “40 mg” separated due to how indexing works.
By default
omeprazole 40
should return
omeprazole 40mg
(since it’s a prefix match) and
omeprazole 40 mg
j
Thank you guys, I will look into how i can do that sanitization beforehand, i keep the data in a postgres materialized view and sync it with airbyte. And @Jason Bosco you are right, i made a mistake in my original question because the query i am using is actually
omeprazole vial 40 -"open contract"
(vial was interfering with the search and that's why i wasn't getting the correct results)
btw, is there a way to exclude a substring (not exact match with full string) from a specific field? instead of excluding it from the whole query
j
Another thing to consider for your use-case could be to run it through an entity detection ML algorithm to extract key pieces of information and index them in separate fields: https://developers.google.com/ml-kit/language/entity-extraction Edit: looks like the Google one only supports a limited set of entities. May be there are other APIs like this for units of measure
j
thank you so much! i will look into that 😄
g
Pretty sure GPT-4 can be used reliably for entity extraction in this case.