Custom Token Separator for Typesense Number and Unit Queries
TLDR Juan asked about custom token separators in Typesense for numbers and units. Jason suggested pre-processing the data or using ML entity extraction, while Gustavo mentioned GPT-4 for entity extraction.
1
Jun 23, 2023 (5 months ago)
Juan
08:56 PMI have a really big collection (1.6m records) with 10 fields that contain text info about some buy orders. In some of those string fields, there are sentences that contain medicine specifications such as dose amount, for example "omeprazole 40mg". The thing is i can't control how it is written on data source and it is not consistent, e.g sometimes it can say "omeprazole 40mg" or ""omeprazole 40 mg"" or "omeprazole40mg". When i search for "q=omeprazole 40" i want all of those records to be returned, but i am only getting those that have "40 mg" separated due to how indexing works.
So the question is, can I have a custom token separator that indexes the word "40mg" into 2 tokens "40" and "mg"? or any number and measure unit like "25ml" so that i can query just the number and get the result
Juan
08:59 PMor the filter_by field, doing stuff like "filter_by=characteristics: [40, 40mg] || name: [40, 40mg]"
But i feel this is kind of cumbersome and can lead to issues
Any recommendations?
Gustavo
09:24 PMJason
09:29 PMIf these are standard units of measure, I’d also recommend what Gustavo mentioned - you can do this tokenization before indexing the data into Typesense. Look for a number followed by a unit of measure (mg, ml, tab, etc) and add a space before and index that it in Typesense.
Jason
09:30 PMBy default
omeprazole 40
should return omeprazole 40mg
(since it’s a prefix match) and omeprazole 40 mg
Juan
09:52 PMomeprazole vial 40 -"open contract"
(vial was interfering with the search and that's why i wasn't getting the correct results)Juan
09:54 PMJason
09:55 PMEdit: looks like the Google one only supports a limited set of entities. May be there are other APIs like this for units of measure
Jason
09:56 PMNo, it’s not possible to do this in Typesense for performance reasons.
1
Juan
09:57 PMGustavo
11:00 PMTypesense
Indexed 3015 threads (79% resolved)
Similar Threads
Inconsistent Search Results with Typesense
David reported inconsistencies with infix searching using Typesense, despite no change in configuration. Upon review, Jason could not consistently reproduce the issue and suggested potential fixes including a debug build on the user's cluster. The issue remains unresolved.
Phrase Search Relevancy and Weights Fix
Jan reported an issue with phrase search relevancy using Typesense Instantsearch Adapter. The problem occurred when searching phrases with double quotes. The team identified the issue to be related to weights and implemented a fix, improving the search results.
Utilizing Vector Search and Word Embeddings for Comprehensive Search in Typesense
Bill sought clarification on using vector search with multiple word embeddings in Typesense and using them instead of OpenAI's embedding. Kishore Nallan and Jason informed him that their development version 0.25 supports open source embedding models. They also resolved Bill's concerns regarding search performance, language support, and limitations in the search parameters.
Issues with Repeated Words and Hyphen Queries in Typesense API
JinW discusses issues with repeated word queries and hyphen-containing queries in Typesense. Kishore Nallan offers possible solutions. During the discussion, Mr seeks advice on `token_separators` and how to send custom headers. Issues remain with repeated word queries.
Issue with Query Expectations on Typesense Search
Sean was having an issue with their search query on Typesense. Kishore Nallan suggested adjusting the 'drop_tokens_threshold' parameter. After making the adjustment, Sean found an issue with the order of the results, which was resolved by updating Typesense version.