Custom Token Separator for Typesense Number and Unit Queries
TLDR Juan asked about custom token separators in Typesense for numbers and units. Jason suggested pre-processing the data or using ML entity extraction, while Gustavo mentioned GPT-4 for entity extraction.
1
Jun 23, 2023 (3 months ago)
Juan
08:56 PMI have a really big collection (1.6m records) with 10 fields that contain text info about some buy orders. In some of those string fields, there are sentences that contain medicine specifications such as dose amount, for example "omeprazole 40mg". The thing is i can't control how it is written on data source and it is not consistent, e.g sometimes it can say "omeprazole 40mg" or ""omeprazole 40 mg"" or "omeprazole40mg". When i search for "q=omeprazole 40" i want all of those records to be returned, but i am only getting those that have "40 mg" separated due to how indexing works.
So the question is, can I have a custom token separator that indexes the word "40mg" into 2 tokens "40" and "mg"? or any number and measure unit like "25ml" so that i can query just the number and get the result
Juan
08:59 PMor the filter_by field, doing stuff like "filter_by=characteristics: [40, 40mg] || name: [40, 40mg]"
But i feel this is kind of cumbersome and can lead to issues
Any recommendations?
Gustavo
09:24 PMJason
09:29 PMIf these are standard units of measure, I’d also recommend what Gustavo mentioned - you can do this tokenization before indexing the data into Typesense. Look for a number followed by a unit of measure (mg, ml, tab, etc) and add a space before and index that it in Typesense.
Jason
09:30 PMBy default
omeprazole 40
should return omeprazole 40mg
(since it’s a prefix match) and omeprazole 40 mg
Juan
09:52 PMomeprazole vial 40 -"open contract"
(vial was interfering with the search and that's why i wasn't getting the correct results)Juan
09:54 PMJason
09:55 PMEdit: looks like the Google one only supports a limited set of entities. May be there are other APIs like this for units of measure
Jason
09:56 PMNo, it’s not possible to do this in Typesense for performance reasons.
1
Juan
09:57 PMGustavo
11:00 PMTypesense
Indexed 2779 threads (79% resolved)
Similar Threads
Phrase Search Relevancy and Weights Fix
Jan reported an issue with phrase search relevancy using Typesense Instantsearch Adapter. The problem occurred when searching phrases with double quotes. The team identified the issue to be related to weights and implemented a fix, improving the search results.
Utilizing Vector Search and Word Embeddings for Comprehensive Search in Typesense
Bill sought clarification on using vector search with multiple word embeddings in Typesense and using them instead of OpenAI's embedding. Kishore Nallan and Jason informed him that their development version 0.25 supports open source embedding models. They also resolved Bill's concerns regarding search performance, language support, and limitations in the search parameters.
Issues with Repeated Words and Hyphen Queries in Typesense API
JinW discusses issues with repeated word queries and hyphen-containing queries in Typesense. Kishore Nallan offers possible solutions. During the discussion, Mr seeks advice on `token_separators` and how to send custom headers. Issues remain with repeated word queries.
Issue with Searching Long Document Uri in Ts Server v0.24.1
Anton encountered an issue when searching for a long document uri in ts server which resulted in error 404. Jason and Kishore Nallan investigated the issue and suggested a server update as well as adjustments in `token_separators`. The problem was resolved after an update to `0.25.0.rc48` DEB.
Resolving Typesense Search Issues
Conversation started by Maximilian about Typesense search behavior led to Users Kishore Nallan and Mike discussing and suggesting workaround, with Kishore Nallan promising an official solution soon. No final confirmation of resolution provided.