#community-help

Facet Count Discrepancy in Integration Testing with Typesense

TLDR SamHendley faced an issue with inconsistent facet counts and found it was due to mixed case input and punctuation. Kishore Nallan confirmed that Typesense normalizes values before indexing, affecting the facet count. They then discussed addition of preconditioning data to match Typesense.

Powered by Struct AI
raised_hands1
12
11mo
Solved
Join the chat
Oct 21, 2022 (11 months ago)
SamHendley
Photo of md5-a9a351e11d64f05b41fec183816a0cda
SamHendley
08:33 PM
I am running a lot of integration tests using a small identical data set and identical queries. I noticed that I occasionally get slightly different counts (off by one) for some facet values in the search results. This got me curious and I did a semi exhaustive check where I compared the facet count returned compared to the results “Found” in a query filtered to just those parameters. For the vast majority of the checked cases the counts are identical. For a few particular facets the counts for many but not all values are off. Almost always the “found” amount is less than was returned in the count. Usually found is around 50-60% of the facet count value. Interestingly the original facet+value that drew my attention is the biggest outlier where the facet count is much smaller than the ‘found count’. Is this something expected/known? Otherwise I’m guessing it must be something about how I am preparing my documents for search but I can’t find anything special about these few values that makes them different from the others. Any guesses about what could cause this? The facets with issues are all string[] arrays. I have a suspicion it might be related to upper/lower casing does that seem possible?
I am using typesense 0.23.1 and a single node running locally in docker.
08:35
SamHendley
08:35 PM
Other than that slight difference in facet count the rest of the differences are reproducible with an empty index and reloading
08:57
SamHendley
08:57 PM
Of course right after asking this I find the answer. Yes it is due to having facets with similar but different case and even punctation. It boils down to having this in the indexed data.
"bad_facet": [
    "More_words",
    "More_Words",
    "MoreWords",
  ], 

Even with only a single document in the index the facet count for More_words is 3 which is very surprising.
08:58
SamHendley
08:58 PM
I’m guessing if I permute the order of those entries it would control which record is reported as the “real one”
Oct 22, 2022 (11 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:46 AM
Typesense "normalizes" words before indexing them so words containing special characters or upper/lower case become the same.
Oct 24, 2022 (11 months ago)
SamHendley
Photo of md5-a9a351e11d64f05b41fec183816a0cda
SamHendley
08:20 PM
I guess my confusion was I was thinking the facet count was a count of documents with at least one matching facet value but I guess it’s actually a counter of number of times the facet is stored across all documents. If the users make sure that there can only be a single facet value per document it works out to the same thing but that is hard to guarantee if I don’t know the normalization rules that are being used.
Oct 25, 2022 (11 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:55 AM
This only happens on an array field right? We treat each array element as an independent value. However if you had those three words in a plain text field they will be counted as a single value.
SamHendley
Photo of md5-a9a351e11d64f05b41fec183816a0cda
SamHendley
12:18 PM
Yes this was an array. Is there documentation on the normalization rules you apply so I can precondition my input to match what typesense is going to do?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:24 PM
It's not documented but broadly for English, we split by space, lower case all characters and remove any special characters thats not part of symbols_to_index collection parameter.
SamHendley
Photo of md5-a9a351e11d64f05b41fec183816a0cda
SamHendley
12:26 PM
How do you pick which string to choose as the ‘displayed’ string (non normalized)? Is it just a lexographic mimum/maximum or something like that? I had thought first encountered but that might be unstable with multiple nodes
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:29 PM
We just pick the first document we encounter from the result set and use the value from that.
SamHendley
Photo of md5-a9a351e11d64f05b41fec183816a0cda
SamHendley
12:36 PM
Ok. Thanks for the help (as always).
raised_hands1