hi everyone! just realized that Typesense collecti...
# community-help
h
hi everyone! just realized that Typesense collections just support ISO 639-1 language code as "locale" codes. This is a MAJOR limitation for enterprise and is not the best practice for locales. (for example if I want pt-PT / european portuguese / portuguese that applies to portugal within europe continent, and the content is totally different from pt-BR / brazilian portuguese / portuguese that applies to brazil witin south Ameria continent... the proper definition of locale follows the format <language>-<region>, where: • Language: A two-letter ISO 639-1 code representing the primary language (e.g., pt for Portuguese, es for Spanish). • Region: A two-letter ISO 3166-1 alpha-2 country code specifying the regional variant (e.g., PT for Portugal, BR for Brazil). *my question:* is there any simple known way to force proper locales on a typesense server so I can store embeddings per locale and enable semantic search on different LOCALES (not just languages)?
for now the most direct workaround I see is separating the collections per locale... 😕 not ideal and a mess to maintain...
LLMs say it might be related to some stemming dictionary or server configuration
f
Typesense’s stemming is language level. Snowball (the default stemming algorithm used by Typesense) does not offer region specific stemmers, so pt PT and pt BR both stem as portuguese. Stemming is not calculated by the model generating the embeddings. - Stemming dictionaries only affect keyword search when a string field has stem set to true - They do not touch embeddings