Arabic Diacritics Restoration for Algiers Dialect Texts

Telechargé par lgsmus

Téléchargement

Diacrics restoraon for Arabic dialect texts

1. Introduction

Salilma Harrat

Mourad Abbas

Karima Meouh

Kamel Samili

ENS Bouzareah, Algiers, Algeria

CRSTDLA, Algiers, Algeria

Badji Mokhtar University, Annaba, Algeria

Campus scien()que LORIA , Nancy, France

Vocalization, diactritization or diacritics restoration is one of the major challenges

for the Arabic natural language processing. For Arabic language (even for other

languages), it remains an unresolved problem. Algiers dialect as an Arabic

language is concerned by this problem which produces several challenges for

many automatic tasks on this language. we present an automatic diacritization

system for Arabic texts based on statistical approach. We attempt to use available

tools for statistical machine translation for building such a system which basically

does not require any linguistic knowledge. We began by working on MSA texts for

many reasons: Algiers dialect is an Arabic language which obeys to almost the

same writing rules of MSA. Availability of diacritized texts in MSA allows to test

our system on a large amount of data, which is not the case for Algiers dialect.

Finally, we worked first on MSA texts because of the available results for many

works in this field.

2. Arabic language

 A Semitic Language with consonantal alphabet (which denotes only

consonants). It Consists of 28 letters denoting consonants and three long

vowels ( a,  w and  y).

 It includes short vowels (  a,  u,  i ) and other phonetic symbols which are

represented by diacritics (strokes placed above or below consonants and long

vowels).

 Short vowels can appear anywhere in the word, the Tanween represents

doubled case endings.

 Arabic diacritics include syllabification marks: the Shadda or gemination mark

denotes a double consonant (it could be combined with short vowels and

doubled case endings), and the Sukun which denotes the absence of vowel.

Diacritized consonant /b/ Name Pronunciation

 Fatha /ba/

 Damma /bu/

 Kasra /bi/

 Tanween Fatha /ban/



 Tanween Damma /bun/



 Tanween Kasra /bin/

 Shadda /bb/

 Sukun /b/

Table: Arabic diacritics and their pronunciationnn

Arabic diacritics are used for disambiguation. A word without diacritics could

have many valid diactritizations (depending on its grammatical category) which

generates several interpretations.

Diacritic form Meaning Grammatical category



 fassara he explained Verb (active voice)



 fussira was explained Verb (passive voice)



 fasir so walk Conjunction+ verb



 fasirrun and a secret Conjunction+noun



 fasrun statement Noun

Table: Some possible diacritizations of Arabic root  /fsr/

 Represents the dialectal Arabic spoken in Algiers and its periphery. It is

different from the dialects spoken in the other areas of the country.

 Simplifies the morphological and syntactic rules of the written Arabic.

 Uses the Arabic alphabet; however the use of some letters like /ẓ/,  /ḍ/ and

 /ṯ/ is rare.

 Uses some non-Arabic letters like  /p/ as in the word  (dad).

 Besides phonological alteration of words, Algiers dialect drops the case

endings of the written language which are replaced by the Sukun (absence of

vowel), this simplifies diacritization process but generates ambiguity at syntactic

level.

 It uses all Arabic diacritics except the Tanween doubled case endings.

 Like written Arabic, diacritics in Algiers dialect are used for disambiguation, a

word without diacritics could accept many valid diacritizations and then if they

are missing, it would be difficult to understand its meaning.

3. Algiers language

Dialectal Word Valid diacritization Meaning

 /ǧūɀ/

 /ǧawwaɀ/ You or he spent

 /ǧūwɀ/ You pass

 /ǧūwɀ/ Nut

 /qrīt/ 

 /qrīt/ I or you read/studied



 /qarrīt/ I or you teached

Table :Example of multiple diacritizations of dialectal word

4. Baseline system

An SMT system for diacritization based on parallel corpora of diacritized and

undiacritized texts is built. The system is phrase based with default settings:

bidirectional phrase and lexical translation probabilities, distortion model (with

seven features), a word and a phrase penalty and a language model. The system

consists of a word alignment between source and target. A phrase table with

undiacritized and diacritized entries, a trigram language model trained on

diacritized texts.

Using SMT approach assumes the availability of the parallel corpora for source

and target languages. To build such a resource in our case, it is enough to get a

vocalized corpus when available and proceed to remove diacritics from it or in the

other case, diacritize an unvolcalized corpus. The two ways were exploited in our

work, the first one for MSA and the second one for Algiers dialect.

 Building SMT system for diacri(za(on

 The data

Corpus Description Size

Tashkeela Collection of classical Arabic books

downloaded from an on-line library

More than

6M of words

LDC Arabic Treebank

(Part3,V1.0)

Set of 600 documents collected from

Annahar News Texts.

340K

words.

Algiers dialect corpus Set of transcripted (by hand) movies in

algiers dialect language. 23K words

Table: Data description

Each corpus was split into training (80%), developing(10%) and testing(10%)

sets by randomly allocating a set of documents to each task.

5. Expriments

 WER and DER

 WER (Word Error Rate): percentage of incorrectly diacritized words (delimited

by white-space).

 DER (Diacritization Error Rate): percentage of incorrectly diacritized

characters.

Corpus Tashkeela LDC ATB

Distribution WER DER WER DER

Substutution 16.2 1.8 23.1 0.5

Deletion 0.0 1.9 0.0 5.1

Insertion 0.0 0.3 0.0 0.1

Total(WER/DER) 16.2 4.1 23.1 5.7

Table: Word and Diacritization Error Rate Summary (MSA)

 Most errors are concentrated on case endings of words relative to short vowels

and doubled case endings. When ignoring them, the WER and DER decrease

with more than 50% for both corpora.

 Precision and Recall

 Recall rates at word and character level could be deduced from WER and

DER respectively, they correspond to:

and



Results for MSA (Precision/Recall)

Level Word Character

Corpus Recall Precision Recall Precision

Tashkeela 83.8% 85.2% 95.9% 93.1%

ATB 76.9% 89.2% 94.3% 96%

Table: Precision and recall Summary(MSA)

 In terms of precision, slight decrease of Tashkeela rates is noticed compared

with ATB rates.

 Results for Algiers dialect (WER/DER)

Distribution WER DER

Substitution 25.8 0.1

Deletion 0.0 12.6

Insertion 0.0 0.1

Total (WER/DER) 25.8 12.8

Table: Word and Diacritization Error Rate Summary (Algiers dialect)

 For comparison, many tests were run for small MSA corpora extracted from

Tashkeela and ATB with the same size order as Algiers dialect corpus,

recorded WERs and DERs were respectively more than 56.5 and 20.5 at each

time and for both corpora.

 Results for Algiers dialect (Precision/Recall)

Level Word Character

Corpus Recall Precision Recall Precision

Algiers Dialect 74.2% 96.3% 87.2% 98%

Table: Precision and recall Summary (Algiers Dialect)

 High precision rates compared to recorded rates for MSA corpora, due

mainly to absence of words case endings in Algiers dialect and the multiple

diacritization of a word in this language is less important than it is in MSA.

 A statistical approach for diacritization of Algiers dialect texts based on a

machine translation system .

 The solution was experimented on MSA corpora, for positioning the

results against other works.

 No use of any other NLP tools (even if they are available for Arabic

language) in order to be in the same conditions for Algiers dialect which is

an non-resourced language.

 Regards to the small amount of training data for the dialect, acceptable

WER and DER were got when compared to the results with small MSA

corpora.

 For Arabic language side, the results for Tashkeela corpus were slightly

better compared to those for Arabic Treebank corpus. This is mainly due to

the size of the two corpora and the nature of data of each one (the ATB

consists of several kinds of texts with entirely different topics, data disparity

in its case is considerable whereas Tashkeela corpus contains classical

books related to theology).

 In terms of precision, in spite of the small amount of available data for the

dialect, a higher percentage were obtained compared to MSA corpora

(96.3% vs.85.2% and 89.2%) at word level and (98% vs. 93.1% and 96%) at

character level.

 Considering this interesting precision rate, this system will be used for the

vocalization of the rest of the dialectal corpus, this will save human effort and

time since a little amount of the data needs to be corrected (by hand).

 This iterative process will be used for enriching our dialect corpus for

getting better results with few efforts.

6. Conclusion

 Higher WER and DER than for MSA experiments due to the smallness of the

Algiers dialect corpus.

 DER at character level mostly concentrated on the deletion rate due to

absence of words in the training data.

 Several tests were performed by increasing the amount of data at every test,

when the corpus size increases by a little percentage, the WER and DER

decrease also.

Interspeech 2013



Results for MSA (WER/DER)

1 / 1 100%

Documents connexes

Chatnovo Life Parter

Arabic Error Correction: Hybrid System for QALB-Shared Task 2015

Arabic Dialect Morphological Segmentation for Machine Translation

15787559n7a4

Paper 68-Convolutional Neural Networks in Predicting Missing Text

TermFrame: Modeling Knowledge in Karstology

Language and Schooling in Morocco: A Linguistic Analysis

gambarogno - Camping Tamaro

Text Mining with NLP: Feature Engineering & Modeling

Exploring the Past: Unit Plan for 3rd Year LT

bismillah-hir-rahman-nir-raheem compress

Merci pour votre participation!

Faire une suggestion

Avez-vous trouvé des erreurs dans l'interface ou les textes ? Ou savez-vous comment améliorer l'interface utilisateur de StudyLib ? N'hésitez pas à envoyer vos suggestions. C'est très important pour nous!

GDPR Confidentialité Conditions d'utilisation

Arabic Diacritics Restoration for Algiers Dialect Texts

Documents connexes

Faire une suggestion

Produits

Assistance

Produits

Assistance

Arabic Diacritics Restoration for Algiers Dialect Texts

Documents connexes

Faire une suggestion

Produits

Assistance

Ajouter ce document à la (aux) collections

Ajouter ce document à enregistré

Suggérez-nous comment améliorer StudyLib