#community-help

Best Practices for Multisearch Across Collections and Removing Non-important Words

TLDR robert asked for best practices on multisearching across collections and deduping results. He later asked about lessening the importance of trivial words in the search results. Kishore Nallan suggested implementing stop words and a proper Q&A model to tackle semantic queries.

Powered by Struct AI
13
11mo
Solved
Join the chat
Nov 12, 2022 (11 months ago)
robert
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
robert
01:51 PM
Whats the best practice for doing multiple keyword searches in multiple collections? How can i dedupe the results across the multisearch?

For example, I have the keywords "programming", "organization", and "files". I have two collections I want to search these keywords in uniquely. Then I want to share the results by grouping the keywords together and deduping their results within a collection. I have two different UI views sharing the reuslts from the two different collections.

Is this all client side manipulation?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:58 PM
When you say dedupe, does that mean the same document exists in multiple collections?
robert
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
robert
01:58 PM
To add on this, is there a featuer where commas could do this server side.
01:59
robert
01:59 PM
If i do a multisearch on the collection and it returns back the same document on multiple searches, how do i group the multisearches together? I guess I would just input into a set on id. Thats easy to do client side. Ignore that part.
02:02
robert
02:02 PM
Let me make a clearer example:

Documents in collection contain paragraphs of text. I'm trying to search paragraphs for keywords.

For example document might have a paragraph. "My organization services the poor and unneeded. We do that by providing clothes & shelter. We also support programming efforts by xyz."

The user then has a question:

"What is your organization's mission?"

On client side i've explored breaking down the question into semantic keywords like: organization, mission.

I then want to search "organization, mission" and match individually on the keywords to retrieve the above paragraph.

There are multiple collections that contain different "paragraph-like" snippets as the one above but serve different UI purposes.

How do I best utilize typesense to solve this particular use case?
02:09
robert
02:09 PM
Matches on words like is, yours, the, etc aren't really valuable. Its more of a semantic search combined with textsearch of typesense. Just curious if y'all have use case examples of that. I'd imagine I'm not alone in leveraging typesense in a similar way 🙂
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:40 PM
This requires a question answering machine learning model.. see this tutorial for reference: https://medium.com/analytics-vidhya/question-answering-system-with-bert-ebe1130f8def
robert
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
robert
02:43 PM
Yeah i think a simpler model right now is wht I'm doing:
1. User has question
2. Use openai to parse question like "What is your mission statement" to get output "mission statement".
3. Take keyword and search against typesense against multiple collections
4. Show results
The problem with the above is when the question is something like "What is the programming of your organization. How do you ensure equal results? What is the real answer to god?"

And we parse that into "programming, ensure equal results, answer to god" and we want to equally search all of those keywords agianst multiple collections.

My questions are:
1. How to lessen the weight of non important words (the, and, your, etc) in results?
2. Is there a best practice for multisearch in the above scenario?
02:45
robert
02:45 PM
if "answer to god" is the phrase to search, and only thing it can find is "to" in some of my paragraphs, i don't want to show that result. it doesn't contain the key words in the phrase. Does that make sense?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:46 PM
You have to treat those words as stop words and remove them from query. However whatever we try to do here you will run into various edge cases and limitations. This type of highly semantic queries requires a proper q&a model.
robert
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
robert
02:47 PM
Good to know. Thanks Kishore
02:48
robert
02:48 PM
Can you explain briefly what kind of edge cases. I can expect Kishore Nallan?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:20 PM
Things like a word which is otherwise a stopword being actually relevant in some contexts. You can't really encode a question into a bunch of keywords easily. When/where/why are all very different even if they might share the same keywords.