Hello Team, First of all, thank you for the great...
# community-help
p
Hello Team, First of all, thank you for the great search engine! typesense I am using it for documentation search, and I have some benchmarks: records of search term with positions where users were clicked. As I saw somewhere in GitHub message about stemming improvement, and I tried stemming on 28.0.rc32 version. I want to share with you my results in the thread.
search term | comments ---------------------------- •
universal search
. It’s the name of my docs search, so I don’t want to stem it actually. It was stemmed to
uni
, and found words like
unit
,
united
, etc. •
rhy
. We have internal service name
rhythm
, so before stemming it was found by prefix search. Now it was stemmed to
rhi
, therefore it found some another results (
rhino
), but not
rhythm
. •
ingres grpc
. Without stemming typesense fixes a typo to
ingress
and found many related results by correct name of k8s concept and grpc framework. With stemming it gets
ingr
stem, and it is too far from
ingress
by typos, the results are worse. •
pulse
. We have exactly the same name for a service, so before stemming the service was on the top position by
exact match
mechanism. Now it was stemmed to
puls
and with typo found non-related results because of word
pull
in one typo distance. •
add new limit
. It’s something straightforward_._ So, before it has good results, because we have doc about the topic. Now it found words like: adding, added, or so and results a bit worse. •
adaptive layouts
. Here we have great improvements, because of
layouts
->
layout
, and
adaptive
->
adapt
. •
deployment times
. Here we have improvement, because of
times
->
time
. •
coffee
. It’s our internal term, so before it was found by
exact match
or so, now it stemmed to
coffe
and has poor results. •
sops
. We use the tool, so before it was found by exact match. Now it stemmed to
sop
and has poor results. •
apps responsible
. I guess the intent of search is straight: user wants to find
responsible
for every
app
. So, the only stemming which is needed is
apps
->
app
maybe, but it also has
responsible
->
respons
, so overall the results worse. •
localization
. There is a pretty straightforward search term, but it’s stemmed to
local
, so the results is much worse. •
analy
. It’s not completed word of
analytics
, before stemming it was handled by prefix search, but now it’s stemmed to
anali
, and it has poor results. •
multisession
. Better, because
multisession
->
multisessions
. •
cookie
. Better, because,
cookie
->
cookies
.
I don’t want to decrease search quality, therefore I postpone the idea about stemming enable for now. If I try to evaluate the results, I see, that the positives scenarios are about pluralization. I would definitely use the pluralization instead of stemming, if it was possible. In the other hand I can see, that for a few keywords stemming doesn’t useful in our case, because we have typo tolerance, prefix search, and other parameters, which are already improving our search quality. For many words search terms it maybe better to just use semantic search, we’ll try to evaluate later.
k
We recently added support for dictionary based stemming / pluralization here: https://github.com/typesense/typesense/pull/2062 You can essentially upload your own custom file and we can use that to do the normalization of words.
🥳 1
😍 2
p
Wow, thank you! Looks like exactly what I am needed. Sorry for disturbing, I can google it, but I failed. Anyway, I’ll try it, have a great weekends!
k
No worries, it's a new feature and not yet available on GA release. You have to use a RC build like
28.0.rc35
🙌 1
It's probably available in
28.0.rc32
as well.
👍 1
d
Kishore, just to double check, does typesense use snowball stemming? I’m not sure why does
universal
stems to
uni
instead of
univers
🤔 Snowball demo shows different results
k
Yes we do use snowball. Not sure why it's different on the CPP library.
IMO with the dictionary based stemming, it's far better to just use that.
👌 1