I am running a lot of integration tests using a sm...
# community-help
s
I am running a lot of integration tests using a small identical data set and identical queries. I noticed that I occasionally get slightly different counts (off by one) for some facet values in the search results. This got me curious and I did a semi exhaustive check where I compared the facet count returned compared to the results “Found” in a query filtered to just those parameters. For the vast majority of the checked cases the counts are identical. For a few particular facets the counts for many but not all values are off. Almost always the “found” amount is less than was returned in the count. Usually found is around 50-60% of the facet count value. Interestingly the original facet+value that drew my attention is the biggest outlier where the facet count is much smaller than the ‘found count’. Is this something expected/known? Otherwise I’m guessing it must be something about how I am preparing my documents for search but I can’t find anything special about these few values that makes them different from the others. Any guesses about what could cause this? The facets with issues are all string[] arrays. I have a suspicion it might be related to upper/lower casing does that seem possible? I am using typesense 0.23.1 and a single node running locally in docker.
Other than that slight difference in facet count the rest of the differences are reproducible with an empty index and reloading
Of course right after asking this I find the answer. Yes it is due to having facets with similar but different case and even punctation. It boils down to having this in the indexed data.
Copy code
"bad_facet": [
    "More_words",
    "More_Words",
    "MoreWords",
  ],
Even with only a single document in the index the facet count for
More_words
is 3 which is very surprising.
I’m guessing if I permute the order of those entries it would control which record is reported as the “real one”
k
Typesense "normalizes" words before indexing them so words containing special characters or upper/lower case become the same.
s
I guess my confusion was I was thinking the facet count was a count of documents with at least one matching facet value but I guess it’s actually a counter of number of times the facet is stored across all documents. If the users make sure that there can only be a single facet value per document it works out to the same thing but that is hard to guarantee if I don’t know the normalization rules that are being used.
k
This only happens on an array field right? We treat each array element as an independent value. However if you had those three words in a plain text field they will be counted as a single value.
s
Yes this was an array. Is there documentation on the normalization rules you apply so I can precondition my input to match what typesense is going to do?
k
It's not documented but broadly for English, we split by space, lower case all characters and remove any special characters thats not part of
symbols_to_index
collection parameter.
s
How do you pick which string to choose as the ‘displayed’ string (non normalized)? Is it just a lexographic mimum/maximum or something like that? I had thought first encountered but that might be unstable with multiple nodes
k
We just pick the first document we encounter from the result set and use the value from that.
s
Ok. Thanks for the help (as always).
🙌 1