TLDR Viktor asks how Typesense calculates relevance and Jason suggests using vector search, specifically S-BERT embeddings, to better match low information queries to relevant documents.
We noticed that the ranking of search results doesn’t take uncommon terms into account to a greater degree than common terms. This results in that less-relevant results being ranked higher than more-relevant results. In contrast to how a TF-IDF-based approach would function. Are we mistaken in this observation? How do you recommend dealing with this - is there some tuning we can do?
Could you share an example of a search query (along with all the search parameters), and a few sample documents that show this issue?
Hey Jason, thanks!
Here is an example of two queries we made:
*Report information security incident*
For this query we get relevant hits
*How do I report an information security incident*
For this query we get a hit on one document that matches all the words except incident (likely typo-corrected match with “incipient”). But the document is not relevant to the query.
In this query “how”, “do”, “I”, and “an” are very low information tokens while “security” or a bigram like “information security” is very high information and would give you a high tf-idf weight.
The document is matching the query mainly because the document is long and the query has many low info tokens in it.
Does the text match score in typesense depend on relative frequency of tokens in the corpus?
The query params:
Ah I see, Typesense does not use TF-IDF
One way you could handle queries like `How do I report an information security incident` is to use vector search.
You’d use something like say S-BERT to generate embeddings, then when users type in a query generate embeddings for those, and then do a nearest neighbor search in Typesense.
ah ok, thanks!
Indexed 3051 threads
Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI