I am running a lot of integration tests using a small identi typesense #community-help

I am running a lot of integration tests using a sm...

SamHendley

10/21/2022, 8:33 PM

I am running a lot of integration tests using a small identical data set and identical queries. I noticed that I occasionally get slightly different counts (off by one) for some facet values in the search results. This got me curious and I did a semi exhaustive check where I compared the facet count returned compared to the results “Found” in a query filtered to just those parameters. For the vast majority of the checked cases the counts are identical. For a few particular facets the counts for many but not all values are off. Almost always the “found” amount is less than was returned in the count. Usually found is around 50-60% of the facet count value. Interestingly the original facet+value that drew my attention is the biggest outlier where the facet count is much smaller than the ‘found count’. Is this something expected/known? Otherwise I’m guessing it must be something about how I am preparing my documents for search but I can’t find anything special about these few values that makes them different from the others. Any guesses about what could cause this? The facets with issues are all string[] arrays. I have a suspicion it might be related to upper/lower casing does that seem possible? I am using typesense 0.23.1 and a single node running locally in docker.

SamHendley

10/21/2022, 8:35 PM

Other than that slight difference in facet count the rest of the differences are reproducible with an empty index and reloading

SamHendley

10/21/2022, 8:57 PM

Of course right after asking this I find the answer. Yes it is due to having facets with similar but different case and even punctation. It boils down to having this in the indexed data.

Copy code

"bad_facet": [
    "More_words",
    "More_Words",
    "MoreWords",
  ],

Even with only a single document in the index the facet count for

More_words

is 3 which is very surprising.

SamHendley

10/21/2022, 8:58 PM

I’m guessing if I permute the order of those entries it would control which record is reported as the “real one”

Kishore Nallan

10/22/2022, 2:46 AM

Typesense "normalizes" words before indexing them so words containing special characters or upper/lower case become the same.

SamHendley

10/24/2022, 8:20 PM

I guess my confusion was I was thinking the facet count was a count of documents with at least one matching facet value but I guess it’s actually a counter of number of times the facet is stored across all documents. If the users make sure that there can only be a single facet value per document it works out to the same thing but that is hard to guarantee if I don’t know the normalization rules that are being used.

Kishore Nallan

10/25/2022, 1:55 AM

This only happens on an array field right? We treat each array element as an independent value. However if you had those three words in a plain text field they will be counted as a single value.

SamHendley

10/25/2022, 12:18 PM

Yes this was an array. Is there documentation on the normalization rules you apply so I can precondition my input to match what typesense is going to do?

Kishore Nallan

10/25/2022, 12:24 PM

It's not documented but broadly for English, we split by space, lower case all characters and remove any special characters thats not part of

symbols_to_index

collection parameter.

SamHendley

10/25/2022, 12:26 PM

How do you pick which string to choose as the ‘displayed’ string (non normalized)? Is it just a lexographic mimum/maximum or something like that? I had thought first encountered but that might be unstable with multiple nodes

Kishore Nallan

10/25/2022, 12:29 PM

We just pick the first document we encounter from the result set and use the value from that.

SamHendley

10/25/2022, 12:36 PM

Ok. Thanks for the help (as always).

🙌 1

Open in Slack

Previous Next