steffen
04/30/2025, 2:14 PMinstantsearch
, connected through the typesense-instantsearch-adapater
.
Scenario: The users are performing case sensitive semantic searches and comparing the results.
User feedback: scrum master
and Scrum master
are semantically rather the same than different. Capitalization on searches shouldn't matter for most situations (especially in English, other languages might differ though ... but this is probably a different story).
User expectation: Both search queries should result in the same semantic results.
Now to the controversy: Apparently embeddings for those two search terms will be different. And therefore, results for vector distances will differ, too. In order to fulfill the clients request though, the question arose where to manipulate the query to ensure same results are returned.
1. A simple, naive solution would manipulation in the UI already. Drawback when doing this: we are messing with uiState
and actually impacting the routing as well (user types in ?q=Scrum+master
, the URL will say ?q=scrum+master
and as soon as the user reloads the page with that URL, it will be lower cased as well). Additionally, this offloads how the search engine should behave rather to the frontend than the engine itself.
2. A better solution might be modifying the query on the "server side". Is there any option to e.g. configure / manage the used tokenizer (or better normalizer) of the auto-embedding to lower case queries first? (This would become handy, as also trailing whitespaces in the query cause different vector search results - throwing off users)
3. Overall, the engineering team understands the semantic search and different results. However, the user has some very specific expectation, which we'd like to get to as close as possible - if possible at all. Meaning, maybe this is such an edge case scenario and we rather need user education only.