Hi, i woul like to clarify new sorting param`bucke...
# community-help
l
Hi, i woul like to clarify new sorting param`bucket_size` in v28. I'm not able to get expected results. I undestand that if
_text_match(bucket_size:3)
than bucket of size 3 is created and than sorted with next criterium. I have two search configs differs only in sort_by
Copy code
1) _text_match:desc,recent_popularity:desc
2) _text_match(bucket_size:3):desc,recent_popularity:desc
From 1) i receive more than 3 results with same text_score and yet in 2) i revecive on firt position item with lower text_score. For better explanation enclosing also picture. Do I understand correclty behaviour of bucket_size param?
message has been deleted
j
CC: @Krunal Gandhi
l
Thanks for noticing, for now i just would like to know whether my expected understanding of functionality
bucket_size
is correct and i can dig more deeper into this issue a try to syntethise example
k
Hi Lukas, For
bucket_size
param,
text_match_score
of all hits inside bucket will be tied and tie breaker will happen on secondary sort param.
k
With
bucket_size: 3
we divide results into groups of 3 records and all the items in the group are deemed to have the same text match score. The secondary sorting condition is then used to sort the items within the group. So the behavior seen in the screenshot is correct, as you can see that the items are sorted on popularity.
One thing that's odd though is the presence of that shoe record. 🤔
We would probably need to see the actual query to check what's happening. If you are on Typesense cloud, please DM me your cluster ID and the actual query you are running.
l
Well, notice that text_score of shoe is actually lower than text_score of stick -> should not be in first bucket of size of 3
let me draw this example
Copy code
bucket_size = 3

text_match_score [10,10,10,10,5,5,5,5,1,1]

buckets = {[10,10,10],[10,5,5],[5,5,5],[1]} → within these buckets sort is applied according second sort param
-> meaning that item with text_match_score 5 cannot jump into first bucket
k
You are correct.
We have to then investigate why this happens.
l
that's exaclty what is happening in screenshot, there are more sticks with same higher text_match score than shoe, but still shoe jumps on the top only because it has the higghest secondary sorting criterium (popularity)
Important is that i understood the functionality correctly, that's what i want to primarily confirm.
We're running on private TS cluster and upgraded to v28 two days ago, so thinking if this can be related also somehow
k
bucket_size
has been introduced only in v28
So there might be a bug. If you can produce a small sample dataset where the issue is present, we will be happy to debug and provide a patched build.
l
Hello @Kishore Nallan, i've created a sample dataset, enclosing here
Schema is pretty straighforward
Copy code
schema = {
    "name": collection,  # Replace with your desired collection name
    "fields": [
        {"name": "short_description", "type": "string", "facet": False},
        {"name": "categories", "type": "string", "facet": False},
        {"name": "recent_popularity", "type": "int32", "facet": False},
    ],
    "default_sorting_field": "recent_popularity",
}
For this search query i receive different results that i would expect
Copy code
search_parameters = {
        "q": query,
        "query_by": "short_description,categories",
        "query_by_weights": "5,3",
        "prefix": "true",
        "sort_by": "_text_match(buckets:2):desc,recent_popularity:desc",
        "limit": "20",
    }
I see, i have same result, please try exacly what i reported above in examples, i.e.
buckset_size:3
f
Huh, yeah. It seems as though it's ignoring the weights for the query_by parameters. Could be something else as well while sorting. Using
_text_match(buckets: 3):desc
will net you the same results as I said about
bucket_size: 2
, since it's 6 / 3 = 2
l
well, not sure if i understood, so you confirm a bug though?
ignoring the weights for the query_by parameters
so, potential issue can be text_match_score respect the
query_by weights
, but sort not?
f
It may be something else, I'll have to debug. From what I see here, this seems to be a bug, as less-scored documents appear to be further up the top.
l
ok, thank you very much. It's is quite critical in our sorting now, so i would appreciate if you let me know, how debugging goes. Thanks
k
Without any bucketing, text match scores look like this:
Copy code
0 -> "score": "578730123365711913",
1 -> "score": "578730123365711913",
2 -> "score": "578730123365711913",
3 -> "score": "578730123365711913",
4 -> "score": "578730123365711897",
5 -> "score": "578730123365711897",
With
bucket_size: 3
scores are grouped into
[0, 1, 2]
and
[3, 4, 5]
indices. Within each group, we pick the first record's text match score as the score for the entire group. Since index 0 and 3 have the same score of
578730123365711913
both the buckets end up with the same text match score. After assigning the anchor score, we sort all the documents on this anchor text match score, so the documents get sorted by popularity. This is the intended way of the bucketing logic: the goal is to fuzz the text match scores such that there is a gradual transition in text match ranking. However, with small result sets like this with a sharp change, this can lead to a behavior like this.
But I can see why this can be a problem. Thinking of a way we can address it.
l
@Kishore Nallan Thank you for great explanation (maybe would be fine to add it into docs as well). I've checked in real-world case and confirm this behaviour. However, I think this is also issue for real-world datasets. Let me explain the issue, we have a large product catalogue where many items fits
query=shoe
with a same
_text_match_score
as catalogue has many shoes (let's say X hundreds). After hundreds results there are other products with lower score and they efffectively can jump (if secondary criterium like popularity is high enough) from 150th position to 4th position with setting
bucket_size:3
as anchor score of 50th bucket (where decline in score happend) will be same as anchor score of 2nd bucket. I think this is not an edge case, but typical situation in our case. Maybe quick solution would be to find out if all scores in bucket are same or if not than do not allow to jump items from this bucket higher in ranking.
f
The bucketing has just been tweaked to address this. It will soon be available in an RC build
🙌 2
l
@Fanis Tharropoulos wow, great, can you share please how you tweak it exactly?
f
k
Instead of using the first record's text match score as the anchor score within a bucket, we are just picking a number that's sequentially increasing for each bucket. This way all documents in a given bucket will never jump ahead of an earlier bucket.
l
I see, this is clear, expected and completely fine solution. (i was just thinking whether for search and ranking purposes would be better to find out if all scores in bucket are same than treat it as your original implementation - i.e. allow to jump items over more buckets as scores are the same and if bucket have not all same scores than do not allow that -> price of this solution is probably more complexity and demand for compute time...)
k
For now, wanted to keep this simpler to reason about. I've published
29.0.rc3
that contains the fix.
👍 1
l
Thanks, we're going to deploy into our develop env today and check it.
👍 1