Best Practices for Multisearch Across Collections and Removing Non-important Words

TLDR robert asked for best practices on multisearching across collections and deduping results. He later asked about lessening the importance of trivial words in the search results. Kishore Nallan suggested implementing stop words and a proper Q&A model to tackle semantic queries.

Powered by Struct AI
Nov 12, 2022 (13 months ago)
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
01:51 PM
Whats the best practice for doing multiple keyword searches in multiple collections? How can i dedupe the results across the multisearch?

For example, I have the keywords "programming", "organization", and "files". I have two collections I want to search these keywords in uniquely. Then I want to share the results by grouping the keywords together and deduping their results within a collection. I have two different UI views sharing the reuslts from the two different collections.

Is this all client side manipulation?
Photo of md5-8813087cccc512313602b6d9f9ece19f
01:58 PM
When you say dedupe, does that mean the same document exists in multiple collections?
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
01:58 PM
To add on this, is there a featuer where commas could do this server side.
01:59 PM
If i do a multisearch on the collection and it returns back the same document on multiple searches, how do i group the multisearches together? I guess I would just input into a set on id. Thats easy to do client side. Ignore that part.
02:02 PM
Let me make a clearer example:

Documents in collection contain paragraphs of text. I'm trying to search paragraphs for keywords.

For example document might have a paragraph. "My organization services the poor and unneeded. We do that by providing clothes & shelter. We also support programming efforts by xyz."

The user then has a question:

"What is your organization's mission?"

On client side i've explored breaking down the question into semantic keywords like: organization, mission.

I then want to search "organization, mission" and match individually on the keywords to retrieve the above paragraph.

There are multiple collections that contain different "paragraph-like" snippets as the one above but serve different UI purposes.

How do I best utilize typesense to solve this particular use case?
02:09 PM
Matches on words like is, yours, the, etc aren't really valuable. Its more of a semantic search combined with textsearch of typesense. Just curious if y'all have use case examples of that. I'd imagine I'm not alone in leveraging typesense in a similar way 🙂
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:40 PM
This requires a question answering machine learning model.. see this tutorial for reference: https://medium.com/analytics-vidhya/question-answering-system-with-bert-ebe1130f8def
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
02:43 PM
Yeah i think a simpler model right now is wht I'm doing:
1. User has question
2. Use openai to parse question like "What is your mission statement" to get output "mission statement".
3. Take keyword and search against typesense against multiple collections
4. Show results
The problem with the above is when the question is something like "What is the programming of your organization. How do you ensure equal results? What is the real answer to god?"

And we parse that into "programming, ensure equal results, answer to god" and we want to equally search all of those keywords agianst multiple collections.

My questions are:
1. How to lessen the weight of non important words (the, and, your, etc) in results?
2. Is there a best practice for multisearch in the above scenario?
02:45 PM
if "answer to god" is the phrase to search, and only thing it can find is "to" in some of my paragraphs, i don't want to show that result. it doesn't contain the key words in the phrase. Does that make sense?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:46 PM
You have to treat those words as stop words and remove them from query. However whatever we try to do here you will run into various edge cases and limitations. This type of highly semantic queries requires a proper q&a model.
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
02:47 PM
Good to know. Thanks Kishore
02:48 PM
Can you explain briefly what kind of edge cases. I can expect Kishore Nallan?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:20 PM
Things like a word which is otherwise a stopword being actually relevant in some contexts. You can't really encode a question into a bunch of keywords easily. When/where/why are all very different even if they might share the same keywords.


Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Phrase Search Relevancy and Weights Fix

Jan reported an issue with phrase search relevancy using Typesense Instantsearch Adapter. The problem occurred when searching phrases with double quotes. The team identified the issue to be related to weights and implemented a fix, improving the search results.



Moving from Algolia to Typesense: Questions and Answers

Juan sought advice from Kishore Nallan about moving from Algolia to Typesense, handling MultiSearch, setting parameters, checking imported documents, and a specific syntax query.



Enhancing Vector Search Performance and Response Time using Multi-Search Feature

Bill faced performance issues with vector search using multi_search feature. Jason and Kishore Nallan suggested running models on a GPU and excluding large fields from the search. Through discussion, it was established that adding more CPUs and enabling server-side caching could enhance performance. The thread concluded with the user reaching a resolution.



Integrating Semantic Search with Typesense

Krish wants to integrate a semantic search functionality with typesense but struggles with the limitations. Kishore Nallan provides resources, clarifications and workarounds to the raised issues.



Utilizing Vector Search and Word Embeddings for Comprehensive Search in Typesense

Bill sought clarification on using vector search with multiple word embeddings in Typesense and using them instead of OpenAI's embedding. Kishore Nallan and Jason informed him that their development version 0.25 supports open source embedding models. They also resolved Bill's concerns regarding search performance, language support, and limitations in the search parameters.