Improving Search Relevance with Typesense

TLDR Viktor asks how Typesense calculates relevance and Jason suggests using vector search, specifically S-BERT embeddings, to better match low information queries to relevant documents.

Photo of Viktor
Viktor
Mon, 13 Feb 2023 17:06:02 UTC

We noticed that the ranking of search results doesn’t take uncommon terms into account to a greater degree than common terms. This results in that less-relevant results being ranked higher than more-relevant results. In contrast to how a TF-IDF-based approach would function. Are we mistaken in this observation? How do you recommend dealing with this - is there some tuning we can do?

Photo of Jason
Jason
Mon, 13 Feb 2023 22:43:00 UTC

Could you share an example of a search query (along with all the search parameters), and a few sample documents that show this issue?

Photo of Dadi
Dadi
Tue, 14 Feb 2023 09:01:32 UTC

Hey Jason, thanks! Here is an example of two queries we made: *Report information security incident* For this query we get relevant hits *How do I report an information security incident* For this query we get a hit on one document that matches all the words except incident (likely typo-corrected match with “incipient”). But the document is not relevant to the query. In this query “how”, “do”, “I”, and “an” are very low information tokens while “security” or a bigram like “information security” is very high information and would give you a high tf-idf weight. The document is matching the query mainly because the document is long and the query has many low info tokens in it. Does the text match score in typesense depend on relative frequency of tokens in the corpus?

Photo of Dadi
Dadi
Tue, 14 Feb 2023 09:03:28 UTC

The query params: ```''```

Photo of Jason
Jason
Wed, 15 Feb 2023 17:22:44 UTC

Ah I see, Typesense does not use TF-IDF

Photo of Jason
Jason
Wed, 15 Feb 2023 17:24:08 UTC

One way you could handle queries like `How do I report an information security incident` is to use vector search. You’d use something like say S-BERT to generate embeddings, then when users type in a query generate embeddings for those, and then do a nearest neighbor search in Typesense.

Photo of Dadi
Dadi
Thu, 16 Feb 2023 10:28:44 UTC

ah ok, thanks!