poster interspeech2013

Telechargé par lgsmus
Diacrics restoraon for Arabic dialect texts
1. Introduction
Salilma Harrat
Mourad Abbas
Karima Meouh
Kamel Samili
ENS Bouzareah, Algiers, Algeria
CRSTDLA, Algiers, Algeria
Badji Mokhtar University, Annaba, Algeria
Campus scien()que LORIA , Nancy, France
Vocalization, diactritization or diacritics restoration is one of the major challenges
for the Arabic natural language processing. For Arabic language (even for other
languages), it remains an unresolved problem. Algiers dialect as an Arabic
language is concerned by this problem which produces several challenges for
many automatic tasks on this language. we present an automatic diacritization
system for Arabic texts based on statistical approach. We attempt to use available
tools for statistical machine translation for building such a system which basically
does not require any linguistic knowledge. We began by working on MSA texts for
many reasons: Algiers dialect is an Arabic language which obeys to almost the
same writing rules of MSA. Availability of diacritized texts in MSA allows to test
our system on a large amount of data, which is not the case for Algiers dialect.
Finally, we worked first on MSA texts because of the available results for many
works in this field.
2. Arabic language
A Semitic Language with consonantal alphabet (which denotes only
consonants). It Consists of 28 letters denoting consonants and three long
vowels ( a, w and y).
It includes short vowels (  a,  u,  i ) and other phonetic symbols which are
represented by diacritics (strokes placed above or below consonants and long
vowels).
Short vowels can appear anywhere in the word, the Tanween represents
doubled case endings.
Arabic diacritics include syllabification marks: the Shadda or gemination mark
denotes a double consonant (it could be combined with short vowels and
doubled case endings), and the Sukun which denotes the absence of vowel.
Diacritized consonant /b/ Name Pronunciation
 Fatha /ba/
 Damma /bu/
 Kasra /bi/
 Tanween Fatha /ban/
 Tanween Damma /bun/
 Tanween Kasra /bin/
 Shadda /bb/
 Sukun /b/
Table: Arabic diacritics and their pronunciationnn
Arabic diacritics are used for disambiguation. A word without diacritics could
have many valid diactritizations (depending on its grammatical category) which
generates several interpretations.
Diacritic form Meaning Grammatical category
fassara he explained Verb (active voice)
fussira was explained Verb (passive voice)
fasir so walk Conjunction+ verb
fasirrun and a secret Conjunction+noun
fasrun statement Noun
Table: Some possible diacritizations of Arabic root  /fsr/
Represents the dialectal Arabic spoken in Algiers and its periphery. It is
different from the dialects spoken in the other areas of the country.
Simplifies the morphological and syntactic rules of the written Arabic.
Uses the Arabic alphabet; however the use of some letters like /ẓ/, /ḍ/ and
/ṯ/ is rare.
Uses some non-Arabic letters like /p/ as in the word  (dad).
Besides phonological alteration of words, Algiers dialect drops the case
endings of the written language which are replaced by the Sukun (absence of
vowel), this simplifies diacritization process but generates ambiguity at syntactic
level.
It uses all Arabic diacritics except the Tanween doubled case endings.
Like written Arabic, diacritics in Algiers dialect are used for disambiguation, a
word without diacritics could accept many valid diacritizations and then if they
are missing, it would be difficult to understand its meaning.
3. Algiers language
Dialectal Word Valid diacritization Meaning
 /ǧūɀ/
/ǧawwaɀ/ You or he spent
 /ǧūwɀ/ You pass
 /ǧūwɀ/ Nut
 /qrīt/
 /qrīt/ I or you read/studied
 /qarrīt/ I or you teached
Table :Example of multiple diacritizations of dialectal word
4. Baseline system
An SMT system for diacritization based on parallel corpora of diacritized and
undiacritized texts is built. The system is phrase based with default settings:
bidirectional phrase and lexical translation probabilities, distortion model (with
seven features), a word and a phrase penalty and a language model. The system
consists of a word alignment between source and target. A phrase table with
undiacritized and diacritized entries, a trigram language model trained on
diacritized texts.
Using SMT approach assumes the availability of the parallel corpora for source
and target languages. To build such a resource in our case, it is enough to get a
vocalized corpus when available and proceed to remove diacritics from it or in the
other case, diacritize an unvolcalized corpus. The two ways were exploited in our
work, the first one for MSA and the second one for Algiers dialect.
Building SMT system for diacri(za(on
The data
Corpus Description Size
Tashkeela Collection of classical Arabic books
downloaded from an on-line library
More than
6M of words
LDC Arabic Treebank
(Part3,V1.0)
Set of 600 documents collected from
Annahar News Texts.
340K
words.
Algiers dialect corpus Set of transcripted (by hand) movies in
algiers dialect language. 23K words
Table: Data description
Each corpus was split into training (80%), developing(10%) and testing(10%)
sets by randomly allocating a set of documents to each task.
5. Expriments
WER and DER
WER (Word Error Rate): percentage of incorrectly diacritized words (delimited
by white-space).
DER (Diacritization Error Rate): percentage of incorrectly diacritized
characters.
Corpus Tashkeela LDC ATB
Distribution WER DER WER DER
Substutution 16.2 1.8 23.1 0.5
Deletion 0.0 1.9 0.0 5.1
Insertion 0.0 0.3 0.0 0.1
Total(WER/DER) 16.2 4.1 23.1 5.7
Table: Word and Diacritization Error Rate Summary (MSA)
Most errors are concentrated on case endings of words relative to short vowels
and doubled case endings. When ignoring them, the WER and DER decrease
with more than 50% for both corpora.
Precision and Recall
Recall rates at word and character level could be deduced from WER and
DER respectively, they correspond to:
and
Results for MSA (Precision/Recall)
Level Word Character
Corpus Recall Precision Recall Precision
Tashkeela 83.8% 85.2% 95.9% 93.1%
ATB 76.9% 89.2% 94.3% 96%
Table: Precision and recall Summary(MSA)
In terms of precision, slight decrease of Tashkeela rates is noticed compared
with ATB rates.
Results for Algiers dialect (WER/DER)
Distribution WER DER
Substitution 25.8 0.1
Deletion 0.0 12.6
Insertion 0.0 0.1
Total (WER/DER) 25.8 12.8
Table: Word and Diacritization Error Rate Summary (Algiers dialect)
For comparison, many tests were run for small MSA corpora extracted from
Tashkeela and ATB with the same size order as Algiers dialect corpus,
recorded WERs and DERs were respectively more than 56.5 and 20.5 at each
time and for both corpora.
Results for Algiers dialect (Precision/Recall)
Level Word Character
Corpus Recall Precision Recall Precision
Algiers Dialect 74.2% 96.3% 87.2% 98%
Table: Precision and recall Summary (Algiers Dialect)
High precision rates compared to recorded rates for MSA corpora, due
mainly to absence of words case endings in Algiers dialect and the multiple
diacritization of a word in this language is less important than it is in MSA.
A statistical approach for diacritization of Algiers dialect texts based on a
machine translation system .
The solution was experimented on MSA corpora, for positioning the
results against other works.
No use of any other NLP tools (even if they are available for Arabic
language) in order to be in the same conditions for Algiers dialect which is
an non-resourced language.
Regards to the small amount of training data for the dialect, acceptable
WER and DER were got when compared to the results with small MSA
corpora.
For Arabic language side, the results for Tashkeela corpus were slightly
better compared to those for Arabic Treebank corpus. This is mainly due to
the size of the two corpora and the nature of data of each one (the ATB
consists of several kinds of texts with entirely different topics, data disparity
in its case is considerable whereas Tashkeela corpus contains classical
books related to theology).
In terms of precision, in spite of the small amount of available data for the
dialect, a higher percentage were obtained compared to MSA corpora
(96.3% vs.85.2% and 89.2%) at word level and (98% vs. 93.1% and 96%) at
character level.
Considering this interesting precision rate, this system will be used for the
vocalization of the rest of the dialectal corpus, this will save human effort and
time since a little amount of the data needs to be corrected (by hand).
This iterative process will be used for enriching our dialect corpus for
getting better results with few efforts.
6. Conclusion
Higher WER and DER than for MSA experiments due to the smallness of the
Algiers dialect corpus.
DER at character level mostly concentrated on the deletion rate due to
absence of words in the training data.
Several tests were performed by increasing the amount of data at every test,
when the corpus size increases by a little percentage, the WER and DER
decrease also.
Interspeech 2013
Results for MSA (WER/DER)
1 / 1 100%

poster interspeech2013

Telechargé par lgsmus
La catégorie de ce document est-elle correcte?
Merci pour votre participation!

Faire une suggestion

Avez-vous trouvé des erreurs dans linterface ou les textes ? Ou savez-vous comment améliorer linterface utilisateur de StudyLib ? Nhésitez pas à envoyer vos suggestions. Cest très important pour nous !