#community-help

Issue with Arabic Text Search

TLDR Muhammet was having issues searching Arabic texts, specifically with diacritics (harakats). Kishore Nallan suggested using the "ar" locale and removing diacritics, but the issue wasn't resolved. Further assistance is needed.

Powered by Struct AI
Jul 13, 2023 (2 months ago)
Muhammet
Photo of md5-99a364c7c21cf788991e3235349f79a5
Muhammet
12:41 PM
Hello

I have a problem. I want to search on Arabic texts. I can't get the desired result in searches due to the actions such as special characters in Arabic.
search text: مَثَلُهُ

text 1: (found this)
فَأَحْيَيْناهُ وَجَعَلْنا لَهُ نُوراً وفي صفة الكافر لم ينسبها إلى نفسه بل قال : كَمَنْ مَثَلُهُ فِي الظُّلُماتِ ولما كانت أنواع الكفر متعددة قال فِي الظُّلُماتِ ولما ذكر جعل النور للميت قال : يَمْشِي بِهِ فِي النَّاسِ أي يصحبه كيف تقلب ، وقال : فِي النَّاسِ إشارة إلى تنويره على نفسه وعلى غيره من الناس فذكر أن منفعة المؤمن ليست مقتصرة على نفسه وقابل تصرفه بالنور وملازمة النور له باستقرار الكافر فِي الظُّلُماتِ وكونه لا يفارقها

text 2: (not found this)
قلت : فنسبة الوهم إلى مثله أولى من نسبته إلى ذاك الجبل حفظاً ؛ كما لا يخفى . وقد خالفه هشام بن سعيد ، فقال : انا معاوية ـ يعني : ابن سلام ـ ... بإسناده المذكور ، فقال : عبد الرحمن بن شيبة ـ مكان عبد الله بن نسيب ـ الذي لا وجود له في كتب الرجال ! رواه أحمد عنه (6/129



Can you help me?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:43 PM
👋 are you able to reproduce this issue when you just index those 2 records?
Muhammet
Photo of md5-99a364c7c21cf788991e3235349f79a5
Muhammet
09:26 PM
Yes. There are many logs where I have this problem. ḥarakāt other records are not listed in the searches. If I search for text with ḥarakāt, only texts with ḥarakāt are listed. When I search for non-ḥarakāt, only vowel point ones are listed even though it's the same text.
Jul 14, 2023 (2 months ago)
Muhammet
Photo of md5-99a364c7c21cf788991e3235349f79a5
Muhammet
08:06 PM
can you help me?
Jul 15, 2023 (2 months ago)
Muhammet
Photo of md5-99a364c7c21cf788991e3235349f79a5
Muhammet
11:18 AM
token_separators is a nice solution but it doesn't support arabic harakats.
Image 1 for token_separators is a nice solution but it doesn't support arabic harakats.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:22 AM
We will have to look into this. I will get back to you in a few days.
11:23
Kishore Nallan
11:23 AM
Btw, did you use the "ar" locale for the fields?
Muhammet
Photo of md5-99a364c7c21cf788991e3235349f79a5
Muhammet
11:26 AM
No I didn't use it. I have never seen such a localization for arabic in the documentation. Let me try and report back.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:33 AM
In any case, I suspect that it doesn't match because of some of the additional characters. See the screenshot attached. First line is the query, and the next two lines contain the word in the two documents. They are different because of the diacritics.
Image 1 for In any case, I suspect that it doesn't match because of some of the additional characters. See the screenshot attached. First line is the query, and the next two lines contain the word in the two documents. They are different because of the diacritics.
Muhammet
Photo of md5-99a364c7c21cf788991e3235349f79a5
Muhammet
11:39 AM
I tried but the same situation persists. It is perceived as different texts with and without Harakat.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:40 AM
Yes, we will have to remove those diacritics for the words to match.
Muhammet
Photo of md5-99a364c7c21cf788991e3235349f79a5
Muhammet
11:55 AM
The field I use for Arabic is also used for all languages. I don't know how much it will affect me if I set Arabic as localization. Seems like the best solution for me is token_separators.