Hello Team First of all thank you for the great search engin typesense #community-help

Hello Team, First of all, thank you for the great...

Pavel Koroteev

01/17/2025, 3:53 PM

Hello Team, First of all, thank you for the great search engine! typesense I am using it for documentation search, and I have some benchmarks: records of search term with positions where users were clicked. As I saw somewhere in GitHub message about stemming improvement, and I tried stemming on 28.0.rc32 version. I want to share with you my results in the thread.

Pavel Koroteev

01/17/2025, 3:55 PM

search term | comments ---------------------------- •

universal search

. It’s the name of my docs search, so I don’t want to stem it actually. It was stemmed to

uni

, and found words like

unit

united

, etc. •

rhy

. We have internal service name

rhythm

, so before stemming it was found by prefix search. Now it was stemmed to

rhi

, therefore it found some another results (

rhino

), but not

rhythm

. •

ingres grpc

. Without stemming typesense fixes a typo to

ingress

and found many related results by correct name of k8s concept and grpc framework. With stemming it gets

ingr

stem, and it is too far from

ingress

by typos, the results are worse. •

pulse

. We have exactly the same name for a service, so before stemming the service was on the top position by

exact match

mechanism. Now it was stemmed to

puls

and with typo found non-related results because of word

pull

in one typo distance. •

add new limit

. It’s something straightforward_._ So, before it has good results, because we have doc about the topic. Now it found words like: adding, added, or so and results a bit worse. •

adaptive layouts

. Here we have great improvements, because of

layouts

layout

, and

adaptive

adapt

. •

deployment times

. Here we have improvement, because of

times

time

. •

coffee

. It’s our internal term, so before it was found by

exact match

or so, now it stemmed to

coffe

and has poor results. •

sops

. We use the tool, so before it was found by exact match. Now it stemmed to

sop

and has poor results. •

apps responsible

. I guess the intent of search is straight: user wants to find

responsible

for every

app

. So, the only stemming which is needed is

apps

app

maybe, but it also has

responsible

respons

, so overall the results worse. •

localization

. There is a pretty straightforward search term, but it’s stemmed to

local

, so the results is much worse. •

analy

. It’s not completed word of

analytics

, before stemming it was handled by prefix search, but now it’s stemmed to

anali

, and it has poor results. •

multisession

. Better, because

multisession

multisessions

. •

cookie

. Better, because,

cookie

cookies

Pavel Koroteev

01/17/2025, 3:57 PM

I don’t want to decrease search quality, therefore I postpone the idea about stemming enable for now. If I try to evaluate the results, I see, that the positives scenarios are about pluralization. I would definitely use the pluralization instead of stemming, if it was possible. In the other hand I can see, that for a few keywords stemming doesn’t useful in our case, because we have typo tolerance, prefix search, and other parameters, which are already improving our search quality. For many words search terms it maybe better to just use semantic search, we’ll try to evaluate later.

Kishore Nallan

01/17/2025, 4:26 PM

We recently added support for dictionary based stemming / pluralization here: https://github.com/typesense/typesense/pull/2062 You can essentially upload your own custom file and we can use that to do the normalization of words.

🥳 1

😍 2

Pavel Koroteev

01/17/2025, 4:29 PM

Wow, thank you! Looks like exactly what I am needed. Sorry for disturbing, I can google it, but I failed. Anyway, I’ll try it, have a great weekends!

Kishore Nallan

01/17/2025, 4:47 PM

No worries, it's a new feature and not yet available on GA release. You have to use a RC build like

28.0.rc35

🙌 1

Kishore Nallan

01/17/2025, 4:48 PM

It's probably available in

28.0.rc32

as well.

👍 1

Dima

01/20/2025, 10:43 AM

Kishore, just to double check, does typesense use snowball stemming? I’m not sure why does

universal

stems to

uni

instead of

univers

🤔 Snowball demo shows different results

Kishore Nallan

01/20/2025, 10:46 AM

Yes we do use snowball. Not sure why it's different on the CPP library.

Kishore Nallan

01/20/2025, 10:47 AM

IMO with the dictionary based stemming, it's far better to just use that.

👌 1

3 Views

Open in Slack

Previous Next