Telechargé par lgsmus

poster interspeech2013

publicité
Diacritics restoration for Arabic dialect texts
Salilma Harrat
ENS Bouzareah, Algiers, Algeria
3. Algiers language
1. Introduction
Vocalization, diactritization or diacritics restoration is one of the major challenges
for the Arabic natural language processing. For Arabic language (even for other
languages), it remains an unresolved problem. Algiers dialect as an Arabic
language is concerned by this problem which produces several challenges for
many automatic tasks on this language. we present an automatic diacritization
system for Arabic texts based on statistical approach. We attempt to use available
tools for statistical machine translation for building such a system which basically
does not require any linguistic knowledge. We began by working on MSA texts for
many reasons: Algiers dialect is an Arabic language which obeys to almost the
same writing rules of MSA. Availability of diacritized texts in MSA allows to test
our system on a large amount of data, which is not the case for Algiers dialect.
Finally, we worked first on MSA texts because of the available results for many
works in this field.
2. Arabic language
 A Semitic Language with consonantal alphabet (which denotes only
consonants). It Consists of 28 letters denoting consonants and three long
vowels (‫ ا‬a, ‫ و‬w and ‫ ي‬y).
 It includes short vowels ( ‫ ـَــ‬a, ‫ ـُــ‬u, ‫ ـِــ‬i ) and other phonetic symbols which are
represented by diacritics (strokes placed above or below consonants and long
vowels).
 Short vowels can appear anywhere in the word, the Tanween represents
doubled case endings.
 Arabic diacritics include syllabification marks: the Shadda or gemination mark
denotes a double consonant (it could be combined with short vowels and
doubled case endings), and the Sukun which denotes the absence of vowel.
Diacritized consonant /b/
Name
Pronunciation
‫ـبَـ‬
‫ــبُـ‬
Fatha
/ba/
Damma
/bu/
‫ــ ِبـ‬
Kasra
/bi/
‫ــبًــا‬
‫ـب‬
ٌ
Tanween Fatha
/ban/
Tanween Damma
/bun/
Tanween Kasra
/bin/
Shadda
/bb/
‫ـب‬
ٍ
‫ــبّــ‬
Sukun
‫ــبْــ‬
/b/
Table: Arabic diacritics and their pronunciationnn
Arabic diacritics are used for disambiguation. A word without diacritics could
have many valid diactritizations (depending on its grammatical category) which
generates several interpretations.
Diacritic form
Meaning
Grammatical category
‫ ف ََّس َر‬fassara
he explained
Verb (active voice)
‫ ف ِ ُّس َر‬fussira
‫َف ِس ْر‬
fasir
was explained
Verb (passive voice)
so walk
Conjunction+ verb
‫ َف ِس ّ ٌر‬fasirrun
and a secret
Conjunction+noun
‫ ف َْس ٌر‬fasrun
statement
Noun
Table: Some possible diacritizations of Arabic root
‫سـر‬
‫ فـــ‬/fsr/
CRSTDLA,
Each Algiers,
corpusAlgeria
was split into training (80%), developing(10%) and testing(10%)
Mourad
Abbas
sets by randomly allocating a set of documents to each task.
 Represents the dialectal Arabic spoken in Algiers and its periphery. It is
Badji Mokhtar University, Annaba, Algeria
different from the dialects spoken in the other areas of the country.
5. Expriments
 Simplifies the morphological and syntactic rules of the written Arabic.
Campus scientifique
LORIA , Nancy, France
 WER
and DER
Meftouh
 Uses the Arabic alphabet; however the use of some letters like ‫ظ‬/ẓ/, ‫ ذ‬/ḍ/ and Karima
 WER (Word Error Rate): percentage of incorrectly diacritized words (delimited
‫ ث‬/ṯ/ is rare.
by white-space).
 Uses some non-Arabic letters like ‫ پ‬/p/ as in the word ‫اپا‬

‫( پـــ‬dad).
DER (Diacritization Error Rate): percentage of incorrectly diacritized
 Besides phonological alteration of words, Algiers dialect drops the case
characters.
Kamel
Samili
endings of the written language which are replaced by the Sukun (absence of
 Precision and Recall
vowel), this simplifies diacritization process but generates ambiguity at syntactic
level.
 It uses all Arabic diacritics except the Tanween doubled case endings.
 Like written Arabic, diacritics in Algiers dialect are used for disambiguation, a
word without diacritics could accept many valid diacritizations and then if they
are missing, it would be difficult to understand its meaning.
 Recall rates at word and character level could be deduced from WER and
DER respectively, they correspond to:
Dialectal Word
Valid diacritization
Meaning
and
‫ جوز‬/ǧūɀ/
‫ قريت‬/qrīt/
‫ َج َّو ْز‬/ǧawwaɀ/
‫ ُجو ْز‬/ǧūwɀ/
You or he spent
‫ ُجو ْز‬/ǧūwɀ/
‫يت‬
ْ ‫ ق ِْر‬/qrīt/
Nut
I or you read/studied
‫يت‬
ْ ‫ َق ِ ّر‬/qarrīt/
I or you teached
You pass
Results for MSA (WER/DER)
Corpus
Table :Example of multiple diacritizations of dialectal word
4. Baseline system
WER
DER
WER
DER
Substutution
16.2
1.8
23.1
0.5
Deletion
0.0
1.9
0.0
5.1
Insertion
0.0
0.3
0.0
0.1
Total(WER/DER)
16.2
4.1
23.1
5.7
The data
Using SMT approach assumes the availability of the parallel corpora for source
and target languages. To build such a resource in our case, it is enough to get a
vocalized corpus when available and proceed to remove diacritics from it or in the
other case, diacritize an unvolcalized corpus. The two ways were exploited in our
work, the first one for MSA and the second one for Algiers dialect.
Corpus
Tashkeela
Description
Collection of classical Arabic books
downloaded from an on-line library
LDC Arabic Treebank
(Part3,V1.0)
Set of 600 documents collected from
Annahar News Texts.
Algiers dialect corpus
Set of transcripted (by hand) movies in
algiers dialect language.
Table: Data description
Size
More than
6M of words
340K
words.
23K words
 Most errors are concentrated on case endings of words relative to short vowels
and doubled case endings. When ignoring them, the WER and DER decrease
with more than 50% for both corpora.
Results for MSA (Precision/Recall)
Level
WER
DER
Substitution
25.8
0.1
Deletion
0.0
12.6
Insertion
0.0
0.1
Total (WER/DER)
25.8
12.8
Table: Word and Diacritization Error Rate Summary (Algiers dialect)
 Results for Algiers dialect (Precision/Recall)
Level
Word
Word
Character
Corpus
Recall
Precision
Recall
Precision
Algiers Dialect
74.2%
96.3%
87.2%
98%
Table: Precision and recall Summary (Algiers Dialect)
 High precision rates compared to recorded rates for MSA corpora, due
mainly to absence of words case endings in Algiers dialect and the multiple
diacritization of a word in this language is less important than it is in MSA.
6. Conclusion
Table: Word and Diacritization Error Rate Summary (MSA)
An SMT system for diacritization based on parallel corpora of diacritized and
undiacritized texts is built. The system is phrase based with default settings:
bidirectional phrase and lexical translation probabilities, distortion model (with
seven features), a word and a phrase penalty and a language model. The system
consists of a word alignment between source and target. A phrase table with
undiacritized and diacritized entries, a trigram language model trained on
diacritized texts.
Distribution
LDC ATB
Distribution
 Building SMT system for diacritization

Tashkeela
 For comparison, many tests were run for small MSA corpora extracted from
Tashkeela and ATB with the same size order as Algiers dialect corpus,
recorded WERs and DERs were respectively more than 56.5 and 20.5 at each
time and for both corpora.
Character
Corpus
Recall
Precision
Recall
Precision
Tashkeela
83.8%
85.2%
95.9%
93.1%
ATB
76.9%
89.2%
94.3%
96%
Table: Precision and recall Summary(MSA)
 In terms of precision, slight decrease of Tashkeela rates is noticed compared
with ATB rates.
 Results for Algiers dialect (WER/DER)
 Higher WER and DER than for MSA experiments due to the smallness of the
Algiers dialect corpus.
 DER at character level mostly concentrated on the deletion rate due to
absence of words in the training data.
 Several tests were performed by increasing the amount of data at every test,
when the corpus size increases by a little percentage, the WER and DER
decrease also.
 A statistical approach for diacritization of Algiers dialect texts based on a
machine translation system .
 The solution was experimented on MSA corpora, for positioning the
results against other works.
 No use of any other NLP tools (even if they are available for Arabic
language) in order to be in the same conditions for Algiers dialect which is
an non-resourced language.
 Regards to the small amount of training data for the dialect, acceptable
WER and DER were got when compared to the results with small MSA
corpora.
 For Arabic language side, the results for Tashkeela corpus were slightly
better compared to those for Arabic Treebank corpus. This is mainly due to
the size of the two corpora and the nature of data of each one (the ATB
consists of several kinds of texts with entirely different topics, data disparity
in its case is considerable whereas Tashkeela corpus contains classical
books related to theology).
 In terms of precision, in spite of the small amount of available data for the
dialect, a higher percentage were obtained compared to MSA corpora
(96.3% vs.85.2% and 89.2%) at word level and (98% vs. 93.1% and 96%) at
character level.
 Considering this interesting precision rate, this system will be used for the
vocalization of the rest of the dialectal corpus, this will save human effort and
time since a little amount of the data needs to be corrected (by hand).
 This iterative process will be used for enriching our dialect corpus for
getting better results with few efforts.
Interspeech 2013
Téléchargement