Diacrics restoraon for Arabic dialect texts
1. Introduction
Salilma Harrat
Mourad Abbas
Karima Meouh
Kamel Samili
ENS Bouzareah, Algiers, Algeria
CRSTDLA, Algiers, Algeria
Badji Mokhtar University, Annaba, Algeria
Campus scien()que LORIA , Nancy, France
Vocalization, diactritization or diacritics restoration is one of the major challenges 
for the Arabic  natural language processing.  For Arabic language (even for other 
languages),  it  remains  an  unresolved  problem.  Algiers  dialect  as  an  Arabic 
language  is  concerned  by  this  problem  which  produces  several  challenges  for 
many  automatic  tasks  on  this  language.  we  present  an  automatic  diacritization 
system for Arabic texts based on statistical approach. We attempt to use available 
tools for statistical machine translation for building such a system which basically 
does not require any linguistic knowledge. We began by working on MSA texts for 
many  reasons: Algiers  dialect  is  an Arabic  language  which  obeys  to  almost  the 
same  writing  rules  of  MSA. Availability  of  diacritized  texts  in  MSA  allows  to  test 
our system on a large amount of data, which is not the case for Algiers dialect. 
Finally,  we  worked  first  on  MSA  texts  because  of  the  available results for  many 
works in this field.
2. Arabic language
  A  Semitic  Language  with  consonantal  alphabet  (which  denotes  only 
consonants).  It  Consists  of  28  letters  denoting  consonants  and  three  long 
vowels ( a,  w and  y). 
 It includes short vowels (  a,  u,  i ) and other phonetic symbols which are 
represented by diacritics (strokes placed above or below consonants and long 
vowels).
  Short  vowels  can  appear  anywhere  in  the  word,  the  Tanween  represents 
doubled case endings.
 Arabic diacritics include syllabification marks: the Shadda or gemination mark 
denotes  a  double  consonant  (it  could  be  combined  with  short  vowels  and 
doubled case endings), and the Sukun which denotes the absence of vowel. 
Diacritized consonant /b/ Name Pronunciation
 Fatha /ba/
 Damma /bu/
 Kasra /bi/
 Tanween Fatha /ban/
 Tanween Damma /bun/
 Tanween Kasra /bin/
 Shadda /bb/
 Sukun /b/
Table: Arabic diacritics and their pronunciationnn
Arabic  diacritics  are  used  for  disambiguation.  A  word  without  diacritics  could 
have many valid diactritizations (depending on its grammatical category) which 
generates several interpretations.
Diacritic form Meaning  Grammatical category
 
  fassara he explained  Verb (active voice)
  
    fussira was explained  Verb (passive voice)
    fasir so walk  Conjunction+ verb
   
   fasirrun and a secret  Conjunction+noun
   
   fasrun statement  Noun
Table: Some possible diacritizations of Arabic root     /fsr/
  Represents  the  dialectal  Arabic  spoken  in  Algiers  and  its  periphery.  It  is 
different from the dialects spoken  in the other areas of the country.
 Simplifies the morphological and syntactic rules of the written Arabic.
 Uses the Arabic alphabet; however the use of some letters like   /ẓ/,  /ḍ/ and 
   /ṯ/ is rare.
 Uses some non-Arabic letters like  /p/ as in the word    (dad). 
  Besides  phonological  alteration  of  words,  Algiers  dialect  drops  the  case 
endings of  the written  language which are  replaced by the Sukun (absence of 
vowel), this simplifies diacritization process but generates ambiguity at syntactic 
level. 
 It uses all Arabic diacritics except the Tanween doubled case endings.
 Like written Arabic, diacritics in Algiers dialect are used for disambiguation, a 
word without  diacritics  could accept  many  valid  diacritizations and  then  if  they 
are missing, it would be difficult to understand its meaning.
3. Algiers language
Dialectal Word Valid diacritization  Meaning
 /ǧūɀ/
 /ǧawwaɀ/ You or he spent
 /ǧūwɀ/ You pass
 /ǧūwɀ/ Nut
 /qrīt/ 
 /qrīt/ I or you read/studied
 /qarrīt/ I or you teached
Table :Example of multiple diacritizations of dialectal word
4. Baseline system
An  SMT  system  for  diacritization  based  on  parallel  corpora    of  diacritized  and 
undiacritized  texts  is  built.  The  system  is  phrase  based  with  default  settings: 
bidirectional  phrase  and  lexical  translation  probabilities,  distortion  model  (with 
seven features), a word and a phrase penalty and a language model. The system 
consists  of  a  word  alignment  between  source  and  target.  A  phrase  table  with 
undiacritized  and  diacritized  entries,  a  trigram  language  model  trained  on 
diacritized texts.
Using  SMT  approach  assumes  the  availability  of  the  parallel  corpora  for  source 
and target languages. To build such a resource in our case, it is enough to get a 
vocalized corpus when available and proceed to remove diacritics from it or in the 
other case, diacritize an unvolcalized corpus. The two ways were exploited in our 
work, the first one for MSA and the second one for Algiers dialect.
 Building SMT system for diacri(za(on
  The data
Corpus Description Size
Tashkeela Collection of classical Arabic books  
downloaded from an on-line library 
More than 
6M of words
LDC Arabic  Treebank 
(Part3,V1.0)
Set of 600 documents collected from 
Annahar News Texts. 
340K 
words.
Algiers dialect corpus Set of transcripted (by hand) movies in 
algiers dialect language.  23K words
Table: Data description
Each  corpus  was  split  into  training  (80%),  developing(10%)  and  testing(10%)   
sets by randomly allocating a set of documents to each task.
5. Expriments
  WER and DER
  WER (Word Error Rate): percentage of incorrectly diacritized words (delimited 
by white-space). 
      DER  (Diacritization  Error    Rate):  percentage  of  incorrectly  diacritized 
characters.
Corpus Tashkeela LDC ATB
Distribution WER DER WER DER
Substutution 16.2 1.8 23.1 0.5
Deletion 0.0 1.9 0.0 5.1
Insertion 0.0 0.3 0.0 0.1
Total(WER/DER) 16.2 4.1 23.1 5.7
Table: Word and Diacritization Error Rate Summary (MSA)
  Most errors are concentrated on case endings of words relative to short vowels 
and  doubled  case  endings.  When  ignoring  them,  the  WER  and  DER  decrease 
with more than 50% for both corpora.
  Precision and Recall
   Recall rates at  word and character level could be deduced from WER and 
DER respectively, they correspond to: 
                                                             and
  Results for MSA (Precision/Recall)
Level Word Character
Corpus Recall Precision Recall Precision
Tashkeela 83.8% 85.2% 95.9% 93.1%
ATB 76.9% 89.2% 94.3% 96%
Table: Precision and recall Summary(MSA)
  In terms of precision, slight decrease of Tashkeela rates is noticed compared 
with ATB rates.
  Results for Algiers dialect (WER/DER)
Distribution WER DER
Substitution 25.8 0.1
Deletion 0.0 12.6
Insertion 0.0 0.1
Total (WER/DER) 25.8 12.8
Table: Word and Diacritization Error Rate Summary  (Algiers dialect)
  For comparison, many tests were run for small MSA corpora extracted from 
Tashkeela  and  ATB  with  the  same  size  order  as  Algiers  dialect  corpus, 
recorded WERs and DERs were respectively more than 56.5 and 20.5 at each 
time and for both corpora.
   Results for Algiers dialect (Precision/Recall)
Level Word Character
Corpus Recall Precision Recall Precision
Algiers Dialect 74.2% 96.3% 87.2% 98%
Table: Precision and recall Summary (Algiers Dialect)
    High  precision    rates  compared  to  recorded  rates  for  MSA  corpora,  due 
mainly  to  absence  of  words  case  endings  in Algiers  dialect  and  the  multiple 
diacritization of a word in this language is less important than it is in MSA.
  A statistical approach for diacritization of Algiers dialect texts based on a 
machine translation system .
    The  solution  was  experimented  on  MSA  corpora,  for  positioning  the 
results against other works. 
    No  use  of  any  other  NLP  tools  (even  if  they  are  available  for  Arabic 
language) in order to be in the  same conditions for Algiers dialect which is 
an non-resourced language. 
    Regards  to  the  small amount  of  training data  for  the  dialect,  acceptable 
WER  and  DER  were  got  when  compared  to  the  results  with  small  MSA 
corpora. 
   For Arabic language side, the results for Tashkeela corpus were slightly 
better compared to those for Arabic Treebank corpus. This is mainly due to 
the  size  of  the  two  corpora  and  the  nature  of  data  of  each  one  (the ATB 
consists of several kinds of texts with entirely different topics, data disparity 
in  its  case  is  considerable  whereas  Tashkeela  corpus  contains  classical 
books related to theology).
  In terms of precision, in spite of the small amount of available data for the 
dialect,  a  higher  percentage  were  obtained  compared  to  MSA  corpora 
(96.3% vs.85.2% and 89.2%) at word level and (98% vs. 93.1% and 96%) at 
character level.
  Considering this interesting precision rate, this system will be used for the 
vocalization of the rest of the dialectal corpus, this will save human effort and 
time since a little amount of the data needs to be corrected (by hand).
    This  iterative  process  will  be  used  for  enriching  our  dialect  corpus  for 
getting better results with few efforts.
 6. Conclusion
  Higher WER and DER than for MSA experiments due to the smallness of the 
Algiers dialect corpus.
    DER  at  character  level  mostly  concentrated  on  the  deletion  rate  due  to 
absence of words in the training data.
  Several tests were performed by increasing the amount of data at every test,  
when  the  corpus  size  increases  by  a  little  percentage,  the  WER  and  DER 
decrease also. 
Interspeech 2013
  Results for MSA (WER/DER)