Hi i woul like to clarify new sorting param`bucket size` in typesense #community-help

Hi, i woul like to clarify new sorting param`bucke...

Lukas Matejka

02/20/2025, 8:01 PM

Hi, i woul like to clarify new sorting param`bucket_size` in v28. I'm not able to get expected results. I undestand that if

_text_match(bucket_size:3)

than bucket of size 3 is created and than sorted with next criterium. I have two search configs differs only in sort_by

Copy code

1) _text_match:desc,recent_popularity:desc
2) _text_match(bucket_size:3):desc,recent_popularity:desc

From 1) i receive more than 3 results with same text_score and yet in 2) i revecive on firt position item with lower text_score. For better explanation enclosing also picture. Do I understand correclty behaviour of bucket_size param?

Lukas Matejka

02/20/2025, 8:07 PM

message has been deleted

Jason Bosco

02/21/2025, 3:43 AM

CC: @Krunal Gandhi

Lukas Matejka

02/21/2025, 8:57 AM

Thanks for noticing, for now i just would like to know whether my expected understanding of functionality

bucket_size

is correct and i can dig more deeper into this issue a try to syntethise example

Krunal Gandhi

02/21/2025, 9:20 AM

Hi Lukas, For

bucket_size

param,

text_match_score

of all hits inside bucket will be tied and tie breaker will happen on secondary sort param.

Kishore Nallan

02/21/2025, 12:49 PM

With

bucket_size: 3

we divide results into groups of 3 records and all the items in the group are deemed to have the same text match score. The secondary sorting condition is then used to sort the items within the group. So the behavior seen in the screenshot is correct, as you can see that the items are sorted on popularity.

Kishore Nallan

02/21/2025, 12:50 PM

One thing that's odd though is the presence of that shoe record. 🤔

Kishore Nallan

02/21/2025, 12:51 PM

We would probably need to see the actual query to check what's happening. If you are on Typesense cloud, please DM me your cluster ID and the actual query you are running.

Lukas Matejka

02/21/2025, 1:06 PM

Well, notice that text_score of shoe is actually lower than text_score of stick -> should not be in first bucket of size of 3

Lukas Matejka

02/21/2025, 1:06 PM

let me draw this example

Copy code

bucket_size = 3

text_match_score [10,10,10,10,5,5,5,5,1,1]

buckets = {[10,10,10],[10,5,5],[5,5,5],[1]} → within these buckets sort is applied according second sort param

Lukas Matejka

02/21/2025, 1:07 PM

-> meaning that item with text_match_score 5 cannot jump into first bucket

Kishore Nallan

02/21/2025, 1:09 PM

You are correct.

Kishore Nallan

02/21/2025, 1:09 PM

We have to then investigate why this happens.

Lukas Matejka

02/21/2025, 1:10 PM

that's exaclty what is happening in screenshot, there are more sticks with same higher text_match score than shoe, but still shoe jumps on the top only because it has the higghest secondary sorting criterium (popularity)

Lukas Matejka

02/21/2025, 1:12 PM

Important is that i understood the functionality correctly, that's what i want to primarily confirm.

Lukas Matejka

02/21/2025, 1:13 PM

We're running on private TS cluster and upgraded to v28 two days ago, so thinking if this can be related also somehow

Kishore Nallan

02/21/2025, 1:14 PM

bucket_size

has been introduced only in v28

Kishore Nallan

02/21/2025, 1:15 PM

So there might be a bug. If you can produce a small sample dataset where the issue is present, we will be happy to debug and provide a patched build.

Lukas Matejka

02/24/2025, 10:42 AM

Hello @Kishore Nallan, i've created a sample dataset, enclosing here

bucketing_debug.jsonl

Lukas Matejka

02/24/2025, 10:42 AM

Schema is pretty straighforward

Copy code

schema = {
    "name": collection,  # Replace with your desired collection name
    "fields": [
        {"name": "short_description", "type": "string", "facet": False},
        {"name": "categories", "type": "string", "facet": False},
        {"name": "recent_popularity", "type": "int32", "facet": False},
    ],
    "default_sorting_field": "recent_popularity",
}

Lukas Matejka

02/24/2025, 10:43 AM

For this search query i receive different results that i would expect

Copy code

search_parameters = {
        "q": query,
        "query_by": "short_description,categories",
        "query_by_weights": "5,3",
        "prefix": "true",
        "sort_by": "_text_match(buckets:2):desc,recent_popularity:desc",
        "limit": "20",
    }

Lukas Matejka

02/24/2025, 10:45 AM

Items with popularity 1000 and 100 are on top of results, regardless text_match_scores which are lower and should be in 2nd bucket -> should not be at top

Lukas Matejka

02/24/2025, 10:46 AM

same for query

Copy code

search_parameters = {
        "q": query,
        "query_by": "short_description,categories",
        "query_by_weights": "5,3",
        "prefix": "true",
        "sort_by": "_text_match(bucket_size:3):desc,recent_popularity:desc",
        "limit": "20",
    }

I would expect that items with popularity 1000 and 100 cannot jump at the top, as there are at least 3 items with higher text_match_score

Fanis Tharropoulos

02/24/2025, 10:48 AM

What's the

query

object's value here?

Lukas Matejka

02/24/2025, 10:50 AM

query=shoe

Fanis Tharropoulos

02/24/2025, 10:56 AM

Can confirm that it works when using

bucket_size

but not

buckets

Lukas Matejka

02/24/2025, 10:58 AM

Can you please explain "works" for using

bucket_size

Fanis Tharropoulos

02/24/2025, 10:59 AM

If you use

bucket_size: 2

for example, you'll get the first bucket with the two documents matching the score, and then sorted based on popularityd

Lukas Matejka

02/24/2025, 11:00 AM

for

bucket_size=3

i can see following output, having items with popularity 1000 and 100 at top

Fanis Tharropoulos

02/24/2025, 11:02 AM

Hits on 28.0:

Copy code

[
  {
    "document": { "categories": "", "id": "3", "recent_popularity": 13, "short_description": "shoe" },
    "highlight": { "short_description": { "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" } },
    "highlights": [{ "field": "short_description", "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" }],
    "text_match": 578730123365711913,
    "text_match_info": {
      "best_field_score": "1108091339008",
      "best_field_weight": 5,
      "fields_matched": 1,
      "num_tokens_dropped": 0,
      "score": "578730123365711913",
      "tokens_matched": 1,
      "typo_prefix_score": 0
    }
  },
  {
    "document": { "categories": "", "id": "2", "recent_popularity": 12, "short_description": "shoe" },
    "highlight": { "short_description": { "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" } },
    "highlights": [{ "field": "short_description", "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" }],
    "text_match": 578730123365711913,
    "text_match_info": {
      "best_field_score": "1108091339008",
      "best_field_weight": 5,
      "fields_matched": 1,
      "num_tokens_dropped": 0,
      "score": "578730123365711913",
      "tokens_matched": 1,
      "typo_prefix_score": 0
    }
  },
  {
    "document": { "categories": "", "id": "1", "recent_popularity": 11, "short_description": "shoe" },
    "highlight": { "short_description": { "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" } },
    "highlights": [{ "field": "short_description", "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" }],
    "text_match": 578730123365711913,
    "text_match_info": {
      "best_field_score": "1108091339008",
      "best_field_weight": 5,
      "fields_matched": 1,
      "num_tokens_dropped": 0,
      "score": "578730123365711913",
      "tokens_matched": 1,
      "typo_prefix_score": 0
    }
  },
  {
    "document": { "categories": "", "id": "0", "recent_popularity": 10, "short_description": "shoe" },
    "highlight": { "short_description": { "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" } },
    "highlights": [{ "field": "short_description", "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" }],
    "text_match": 578730123365711913,
    "text_match_info": {
      "best_field_score": "1108091339008",
      "best_field_weight": 5,
      "fields_matched": 1,
      "num_tokens_dropped": 0,
      "score": "578730123365711913",
      "tokens_matched": 1,
      "typo_prefix_score": 0
    }
  },
  {
    "document": { "categories": "shoe", "id": "5", "recent_popularity": 1000, "short_description": "" },
    "highlight": { "categories": { "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" } },
    "highlights": [{ "field": "categories", "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" }],
    "text_match": 578730123365711897,
    "text_match_info": {
      "best_field_score": "1108091339008",
      "best_field_weight": 3,
      "fields_matched": 1,
      "num_tokens_dropped": 0,
      "score": "578730123365711897",
      "tokens_matched": 1,
      "typo_prefix_score": 0
    }
  },
  {
    "document": { "categories": "shoe", "id": "4", "recent_popularity": 100, "short_description": "" },
    "highlight": { "categories": { "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" } },
    "highlights": [{ "field": "categories", "matched_tokens": ["shoe"], "snippet": "<mark>shoe</mark>" }],
    "text_match": 578730123365711897,
    "text_match_info": {
      "best_field_score": "1108091339008",
      "best_field_weight": 3,
      "fields_matched": 1,
      "num_tokens_dropped": 0,
      "score": "578730123365711897",
      "tokens_matched": 1,
      "typo_prefix_score": 0
    }
  }
]

Lukas Matejka

02/24/2025, 11:05 AM

this response ^^ is for bucket_size:3 ?

Fanis Tharropoulos

02/24/2025, 11:05 AM

This is specifically for

bucket_size: 2

Lukas Matejka

02/24/2025, 11:06 AM

I see, i have same result, please try exacly what i reported above in examples, i.e.

buckset_size:3

Fanis Tharropoulos

02/24/2025, 11:09 AM

Huh, yeah. It seems as though it's ignoring the weights for the query_by parameters. Could be something else as well while sorting. Using

_text_match(buckets: 3):desc

will net you the same results as I said about

bucket_size: 2

, since it's 6 / 3 = 2

Lukas Matejka

02/24/2025, 11:14 AM

well, not sure if i understood, so you confirm a bug though?

ignoring the weights for the query_by parameters

so, potential issue can be text_match_score respect the

query_by weights

, but sort not?

Fanis Tharropoulos

02/24/2025, 11:42 AM

It may be something else, I'll have to debug. From what I see here, this seems to be a bug, as less-scored documents appear to be further up the top.

Lukas Matejka

02/24/2025, 1:30 PM

ok, thank you very much. It's is quite critical in our sorting now, so i would appreciate if you let me know, how debugging goes. Thanks

Kishore Nallan

02/24/2025, 3:31 PM

Without any bucketing, text match scores look like this:

Copy code

0 -> "score": "578730123365711913",
1 -> "score": "578730123365711913",
2 -> "score": "578730123365711913",
3 -> "score": "578730123365711913",
4 -> "score": "578730123365711897",
5 -> "score": "578730123365711897",

With

bucket_size: 3

scores are grouped into

[0, 1, 2]

and

[3, 4, 5]

indices. Within each group, we pick the first record's text match score as the score for the entire group. Since index 0 and 3 have the same score of

578730123365711913

both the buckets end up with the same text match score. After assigning the anchor score, we sort all the documents on this anchor text match score, so the documents get sorted by popularity. This is the intended way of the bucketing logic: the goal is to fuzz the text match scores such that there is a gradual transition in text match ranking. However, with small result sets like this with a sharp change, this can lead to a behavior like this.

Kishore Nallan

02/24/2025, 3:35 PM

But I can see why this can be a problem. Thinking of a way we can address it.

Lukas Matejka

02/25/2025, 8:56 AM

@Kishore Nallan Thank you for great explanation (maybe would be fine to add it into docs as well). I've checked in real-world case and confirm this behaviour. However, I think this is also issue for real-world datasets. Let me explain the issue, we have a large product catalogue where many items fits

query=shoe

with a same

_text_match_score

as catalogue has many shoes (let's say X hundreds). After hundreds results there are other products with lower score and they efffectively can jump (if secondary criterium like popularity is high enough) from 150th position to 4th position with setting

bucket_size:3

as anchor score of 50th bucket (where decline in score happend) will be same as anchor score of 2nd bucket. I think this is not an edge case, but typical situation in our case. Maybe quick solution would be to find out if all scores in bucket are same or if not than do not allow to jump items from this bucket higher in ranking.

Fanis Tharropoulos

02/25/2025, 9:00 AM

The bucketing has just been tweaked to address this. It will soon be available in an RC build

🙌 2

Lukas Matejka

02/25/2025, 9:04 AM

@Fanis Tharropoulos wow, great, can you share please how you tweak it exactly?

Fanis Tharropoulos

02/25/2025, 9:05 AM

Here's the commit in question: https://github.com/typesense/typesense/commit/fadc96b8bbf58f388544df08b31ea2ebceb09a95

👍 1

Kishore Nallan

02/25/2025, 9:13 AM

Instead of using the first record's text match score as the anchor score within a bucket, we are just picking a number that's sequentially increasing for each bucket. This way all documents in a given bucket will never jump ahead of an earlier bucket.

Lukas Matejka

02/25/2025, 2:39 PM

I see, this is clear, expected and completely fine solution. (i was just thinking whether for search and ranking purposes would be better to find out if all scores in bucket are same than treat it as your original implementation - i.e. allow to jump items over more buckets as scores are the same and if bucket have not all same scores than do not allow that -> price of this solution is probably more complexity and demand for compute time...)

Kishore Nallan

02/26/2025, 5:57 AM

For now, wanted to keep this simpler to reason about. I've published

29.0.rc3

that contains the fix.

👍 1

Lukas Matejka

02/26/2025, 2:04 PM

Thanks, we're going to deploy into our develop env today and check it.

👍 1

Open in Slack

Previous Next