Hello, I'm having an issue with sorting fields tha...
# community-help
e
Hello, I'm having an issue with sorting fields that start with umlauts (ÅÄÖ). They get normalized to sort in order with A and O. Am I missing something? I have attached a test schema and jsonl-file. The query is as follows:
GET <http://localhost:8108/collections/umlauts/documents/search?q=&query_by=title&sort_by=title:asc>
k
Sorting by string does not work on non English locales because it's tricky to handle sort orders for other locales.
e
I see, that is very unfortunate. I would recommend looking into supporting them as they are part of many languages and have clear rules for how they should sort.
e
have clear rules for how they should sort
It's interesting, I have had a lot of conversations about this last few months. I don't think the rules are clear at all 😅
Consensus, as well looking into Unicode / IEEE / and UTF-8 generally pulls the non-accented characters first. Not sure what language you're using, but in PHP, characters gets sorted by their Unicode value, which does tend to break things unless you're using mb_* functions
That's not to say that you or Typesense are wrong, but just mentioning, there's no clear consensus on how sorting should behave
e
I think what you are seeing is that different languages have different rules, and that some languages have letters with umlauts that are completely separate letters, not just an "accented letter" which might be confusing. PHP can sort by locale: https://www.php.net/manual/en/array.constants.php#constant.sort-locale-string MySQL sorts by the collation: https://dev.mysql.com/doc/refman/8.0/en/charset-collation-effect.html And Unicode has a collation chart https://www.unicode.org/charts/collation/ and a whole project for how to handle languages around the world: https://cldr.unicode.org/ Maybe the ICU-project could be of value for Typesense as it is c++ https://icu.unicode.org/home
OK, now I'm even more confused, collation is built into C++ already? https://en.cppreference.com/w/cpp/locale/collate I don't know anything about C++, but if I choose a locale for my collection string fields, and then choose to sort on it, should it not just follow the locale during sort as in the example above?
k
I've not had a chance to look into this in more detail. The sorting of string fields on Typesense works on byte-order. So it only works reliably with ASCII since other languages are represented as multi-byte utf-8 sequences so relying on byte order won't work for them. We don't use the built-in C++
std::sort()
because we have to store the strings already in sorted order for efficiency reasons.
I suspect C++ uses the locale collection facet (a mapping/ordering of characters) instead of relying on their actual byte values. Need to see how easy it is to implement it in Typesense since we have to store the records in pre-sorted form.
e
I think it is acceptable to take a slight performance hit when setting a specific locale compared to plain ascii