We noticed that the ranking of search results doesn t take u typesense #community-help

We noticed that the ranking of search results does...

Viktor Qvarfordt

02/13/2023, 5:06 PM

We noticed that the ranking of search results doesn’t take uncommon terms into account to a greater degree than common terms. This results in that less-relevant results being ranked higher than more-relevant results. In contrast to how a TF-IDF-based approach would function. Are we mistaken in this observation? How do you recommend dealing with this - is there some tuning we can do?

Jason Bosco

02/13/2023, 10:43 PM

Could you share an example of a search query (along with all the search parameters), and a few sample documents that show this issue?

Dadi Bjarnason

02/14/2023, 9:01 AM

Hey Jason, thanks! Here is an example of two queries we made: Report information security incident For this query we get relevant hits How do I report an information security incident For this query we get a hit on one document that matches all the words except incident (likely typo-corrected match with “incipient”). But the document is not relevant to the query. In this query “how”, “do”, “I”, and “an” are very low information tokens while “security” or a bigram like “information security” is very high information and would give you a high tf-idf weight. The document is matching the query mainly because the document is long and the query has many low info tokens in it. Does the text match score in typesense depend on relative frequency of tokens in the corpus?

Dadi Bjarnason

02/14/2023, 9:03 AM

The query params:

Copy code

'<http://localhost:8108/collections/entities/documents/search?q=how+do+I+report+an+information+security+incident&query_by=title%2Ctext&sort_by=_text_match%3Adesc%2CupdatedAtUnixMs%3Adesc&highlight_full_fields=title&drop_tokens_threshold=3>'

Jason Bosco

02/15/2023, 5:22 PM

Ah I see, Typesense does not use TF-IDF

Jason Bosco

02/15/2023, 5:24 PM

One way you could handle queries like

How do I report an information security incident

is to use vector search. You’d use something like say S-BERT to generate embeddings, then when users type in a query generate embeddings for those, and then do a nearest neighbor search in Typesense.

Dadi Bjarnason

02/16/2023, 10:28 AM

ah ok, thanks!

Open in Slack

Previous Next