#community-help

Solving Conflicts in Searching and Ordering Data with Typesense

TLDR SamHendley faced an issue with search result order in Typesense. Kishore Nallan explained two behaviors that affected the ranking and pledged to change these, while also considering an additional suggestion from SamHendley. These changes were implemented in version 0.24.0.rcn39.

Powered by Struct AI
18
10mo
Solved
Join the chat
Dec 03, 2022 (10 months ago)
SamHendley
Photo of md5-a9a351e11d64f05b41fec183816a0cda
SamHendley
12:40 AM
TL;DR: As I started writing this up I realized I wasn’t 100% sure I knew how it was actually behaving. I prepared a test file which is in the thread that shows some behavior I don’t understand.

I have two conflicting requirements. One is “Bucket things by relevancy and then sort by date” the other is “An exact match in the title should always show first”. I don’t see an obvious way to doing this with the current bucketing primitive. Imagine that I have 100 documents, 1 that has a title “Potatoes” and 99 others have titles like “Report about Potatoes” and secondary text like: “Potatoes are a type of food” or “Aloo is south asian term for Potatoes”. That first document is the food profile is what I really want returned first for a search of “Potato”. The food profile is updated relatively rarely but is still the most important document, the reports are published as interesting things occur and may be spread over a long time range. I have arranged the data and query_by order so the titles are higher ranked than the secondary text so the raw _text_match score for the personal profile is higher than any of the secondary reports (lets imagine it’s much higher).
If I use the simplest sort option _text_match:desc,recent_activity:desc I get the food profile first but then I get the reports in strict ordering of their text match which might mean some recent interesting reports fall off the top page because they have slightly worse text match scores.
So lets take advantage of the bucketing feature. If I change my sort to _text_match(buckets: 100):desc,recent_activity:desc my results are now pretty biased in favor of showing me recent things which is what I wanted. The only problem is I think this would push my food profile doc down the list since it now has same effective score as other documents (all those sharing the highest ranked bucket).
I was going to ask if it would be possible to have the first few highest scores be kept out of the bucketing so the exact match keeps it’s very high score. Could be something like _text_match(buckets: 10, excludeTop:1) .
12:40
SamHendley
12:40 AM
Test file
12:41
SamHendley
12:41 AM
12:42
SamHendley
12:42 AM
The default behavior is basically what I would expect. My ‘exact match’ document is first one when using default sorting of _text_match huzzah.
12:44
SamHendley
12:44 AM
First surprise was that the text_scores for reports 2 and 3 was the same even though they should have been slightly different due to prioritize_token_position . It is sorted correctly so I’m guessing there is some precision that isn’t reported in the _text_match field returned to the client, that’s not a problem, just a surprise
12:49
SamHendley
12:49 AM
If you then look at the results as we change the bucket count the results get really odd. First all documents now appear to have identical reported _text_match score but the sorting is not strictly based on recent_activity so something else must be changing the sort order. That makes me think the values are still there, they just aren’t being reported.
12:50
SamHendley
12:50 AM
buckets:8 gets me pretty close to the result I want, the 3 documents with “Potatoes” in the title are at the top then sorted by recent_activity
12:52
SamHendley
12:52 AM
what’s odd is less or more buckets seem to get worse results. With buckets:100 which I would have thought would give me the most granularity but seems to do the opposite.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:34 AM
May I know what version of Typesense you are using? There is a precision problem with JSON in representing large int64 values used by Typesense for text match score. Because of this, in 0.24 RC builds, we return the score as a string field inside the text_match_info object in the response. This should clear up the confusion with the text match scores looking the same.
SamHendley
Photo of md5-a9a351e11d64f05b41fec183816a0cda
SamHendley
01:06 PM
01:06
SamHendley
01:06 PM
Yes using 0.24.rcn37 shows the more accurate text values. It still doesn’t help me understand why the bucketing is giving unexpected rankings.
01:06
SamHendley
01:06 PM
updated script
Dec 05, 2022 (10 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:53 PM
Couldn't get around to this today. I will take a closer look tomorrow.
Dec 06, 2022 (10 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:59 AM
SamHendley I went through the script. There are a couple of things that we are doing that might explain the behavior you are seeing:

1. We disable prioritize exact match flag when bucketing is enabled. I'm not quite sure why we do this anymore. If there is a strong case for not doing it, I can remove this behavior.
2. When there are more buckets than the number of results (e.g. num_buckets: 100 but only 17 records are found), we put all the records into a single bucket. Instead I wonder if we should not bucket at all, i.e. they retain their original match scores.
SamHendley
Photo of md5-a9a351e11d64f05b41fec183816a0cda
SamHendley
01:16 PM
Ah that starts to make sense, those two things combined explain all the effects I see. I think 2) is a very confusing behavior and disabling bucketing would be a better approach when you have more records than buckets. For 1) It is also very surprising that enabling bucketing sorting would disable an unrelated scoring feature. I guess the idea might have been that introducing that sort of granularity just to then bucket it away seems like a waste of effort?

There was alot in this thread, I had another suggestion you might have missed
> I was going to ask if it would be possible to have the first few highest scores be kept out of the bucketing so the exact match keeps it’s very high score. Could be something like _text_match(buckets: 10, exclude_top:1) .
This would allow hitting both “exact match first” and “show interesting results near top” requirements at same time.
Dec 07, 2022 (10 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:32 AM
I'll make both those changes.
08:33
Kishore Nallan
08:33 AM
The exclude_top is a larger change so that needs to be prioritised with backlog. I'll create a GitHub issue to track.
03:12
Kishore Nallan
03:12 PM
Changes available in 0.24.0.rcn39