Diacritics restoration for Arabic dialect texts Salilma Harrat ENS Bouzareah, Algiers, Algeria 3. Algiers language 1. Introduction Vocalization, diactritization or diacritics restoration is one of the major challenges for the Arabic natural language processing. For Arabic language (even for other languages), it remains an unresolved problem. Algiers dialect as an Arabic language is concerned by this problem which produces several challenges for many automatic tasks on this language. we present an automatic diacritization system for Arabic texts based on statistical approach. We attempt to use available tools for statistical machine translation for building such a system which basically does not require any linguistic knowledge. We began by working on MSA texts for many reasons: Algiers dialect is an Arabic language which obeys to almost the same writing rules of MSA. Availability of diacritized texts in MSA allows to test our system on a large amount of data, which is not the case for Algiers dialect. Finally, we worked first on MSA texts because of the available results for many works in this field. 2. Arabic language A Semitic Language with consonantal alphabet (which denotes only consonants). It Consists of 28 letters denoting consonants and three long vowels ( اa, وw and يy). It includes short vowels ( ـَــa, ـُــu, ـِــi ) and other phonetic symbols which are represented by diacritics (strokes placed above or below consonants and long vowels). Short vowels can appear anywhere in the word, the Tanween represents doubled case endings. Arabic diacritics include syllabification marks: the Shadda or gemination mark denotes a double consonant (it could be combined with short vowels and doubled case endings), and the Sukun which denotes the absence of vowel. Diacritized consonant /b/ Name Pronunciation ـبَـ ــبُـ Fatha /ba/ Damma /bu/ ــ ِبـ Kasra /bi/ ــبًــا ـب ٌ Tanween Fatha /ban/ Tanween Damma /bun/ Tanween Kasra /bin/ Shadda /bb/ ـب ٍ ــبّــ Sukun ــبْــ /b/ Table: Arabic diacritics and their pronunciationnn Arabic diacritics are used for disambiguation. A word without diacritics could have many valid diactritizations (depending on its grammatical category) which generates several interpretations. Diacritic form Meaning Grammatical category ف ََّس َرfassara he explained Verb (active voice) ف ِ ُّس َرfussira َف ِس ْر fasir was explained Verb (passive voice) so walk Conjunction+ verb َف ِس ّ ٌرfasirrun and a secret Conjunction+noun ف َْس ٌرfasrun statement Noun Table: Some possible diacritizations of Arabic root سـر فـــ/fsr/ CRSTDLA, Each Algiers, corpusAlgeria was split into training (80%), developing(10%) and testing(10%) Mourad Abbas sets by randomly allocating a set of documents to each task. Represents the dialectal Arabic spoken in Algiers and its periphery. It is Badji Mokhtar University, Annaba, Algeria different from the dialects spoken in the other areas of the country. 5. Expriments Simplifies the morphological and syntactic rules of the written Arabic. Campus scientifique LORIA , Nancy, France WER and DER Meftouh Uses the Arabic alphabet; however the use of some letters like ظ/ẓ/, ذ/ḍ/ and Karima WER (Word Error Rate): percentage of incorrectly diacritized words (delimited ث/ṯ/ is rare. by white-space). Uses some non-Arabic letters like پ/p/ as in the word اپا ( پـــdad). DER (Diacritization Error Rate): percentage of incorrectly diacritized Besides phonological alteration of words, Algiers dialect drops the case characters. Kamel Samili endings of the written language which are replaced by the Sukun (absence of Precision and Recall vowel), this simplifies diacritization process but generates ambiguity at syntactic level. It uses all Arabic diacritics except the Tanween doubled case endings. Like written Arabic, diacritics in Algiers dialect are used for disambiguation, a word without diacritics could accept many valid diacritizations and then if they are missing, it would be difficult to understand its meaning. Recall rates at word and character level could be deduced from WER and DER respectively, they correspond to: Dialectal Word Valid diacritization Meaning and جوز/ǧūɀ/ قريت/qrīt/ َج َّو ْز/ǧawwaɀ/ ُجو ْز/ǧūwɀ/ You or he spent ُجو ْز/ǧūwɀ/ يت ْ ق ِْر/qrīt/ Nut I or you read/studied يت ْ َق ِ ّر/qarrīt/ I or you teached You pass Results for MSA (WER/DER) Corpus Table :Example of multiple diacritizations of dialectal word 4. Baseline system WER DER WER DER Substutution 16.2 1.8 23.1 0.5 Deletion 0.0 1.9 0.0 5.1 Insertion 0.0 0.3 0.0 0.1 Total(WER/DER) 16.2 4.1 23.1 5.7 The data Using SMT approach assumes the availability of the parallel corpora for source and target languages. To build such a resource in our case, it is enough to get a vocalized corpus when available and proceed to remove diacritics from it or in the other case, diacritize an unvolcalized corpus. The two ways were exploited in our work, the first one for MSA and the second one for Algiers dialect. Corpus Tashkeela Description Collection of classical Arabic books downloaded from an on-line library LDC Arabic Treebank (Part3,V1.0) Set of 600 documents collected from Annahar News Texts. Algiers dialect corpus Set of transcripted (by hand) movies in algiers dialect language. Table: Data description Size More than 6M of words 340K words. 23K words Most errors are concentrated on case endings of words relative to short vowels and doubled case endings. When ignoring them, the WER and DER decrease with more than 50% for both corpora. Results for MSA (Precision/Recall) Level WER DER Substitution 25.8 0.1 Deletion 0.0 12.6 Insertion 0.0 0.1 Total (WER/DER) 25.8 12.8 Table: Word and Diacritization Error Rate Summary (Algiers dialect) Results for Algiers dialect (Precision/Recall) Level Word Word Character Corpus Recall Precision Recall Precision Algiers Dialect 74.2% 96.3% 87.2% 98% Table: Precision and recall Summary (Algiers Dialect) High precision rates compared to recorded rates for MSA corpora, due mainly to absence of words case endings in Algiers dialect and the multiple diacritization of a word in this language is less important than it is in MSA. 6. Conclusion Table: Word and Diacritization Error Rate Summary (MSA) An SMT system for diacritization based on parallel corpora of diacritized and undiacritized texts is built. The system is phrase based with default settings: bidirectional phrase and lexical translation probabilities, distortion model (with seven features), a word and a phrase penalty and a language model. The system consists of a word alignment between source and target. A phrase table with undiacritized and diacritized entries, a trigram language model trained on diacritized texts. Distribution LDC ATB Distribution Building SMT system for diacritization Tashkeela For comparison, many tests were run for small MSA corpora extracted from Tashkeela and ATB with the same size order as Algiers dialect corpus, recorded WERs and DERs were respectively more than 56.5 and 20.5 at each time and for both corpora. Character Corpus Recall Precision Recall Precision Tashkeela 83.8% 85.2% 95.9% 93.1% ATB 76.9% 89.2% 94.3% 96% Table: Precision and recall Summary(MSA) In terms of precision, slight decrease of Tashkeela rates is noticed compared with ATB rates. Results for Algiers dialect (WER/DER) Higher WER and DER than for MSA experiments due to the smallness of the Algiers dialect corpus. DER at character level mostly concentrated on the deletion rate due to absence of words in the training data. Several tests were performed by increasing the amount of data at every test, when the corpus size increases by a little percentage, the WER and DER decrease also. A statistical approach for diacritization of Algiers dialect texts based on a machine translation system . The solution was experimented on MSA corpora, for positioning the results against other works. No use of any other NLP tools (even if they are available for Arabic language) in order to be in the same conditions for Algiers dialect which is an non-resourced language. Regards to the small amount of training data for the dialect, acceptable WER and DER were got when compared to the results with small MSA corpora. For Arabic language side, the results for Tashkeela corpus were slightly better compared to those for Arabic Treebank corpus. This is mainly due to the size of the two corpora and the nature of data of each one (the ATB consists of several kinds of texts with entirely different topics, data disparity in its case is considerable whereas Tashkeela corpus contains classical books related to theology). In terms of precision, in spite of the small amount of available data for the dialect, a higher percentage were obtained compared to MSA corpora (96.3% vs.85.2% and 89.2%) at word level and (98% vs. 93.1% and 96%) at character level. Considering this interesting precision rate, this system will be used for the vocalization of the rest of the dialectal corpus, this will save human effort and time since a little amount of the data needs to be corrected (by hand). This iterative process will be used for enriching our dialect corpus for getting better results with few efforts. Interspeech 2013