ARAN: Automatic Romanizer of Arabic Names
The process of automatically converting unvocalized Arabic to a Roman script representation, called romanization, and such related operations as adding vowels to unvocalized Arabic, called vocalization, are challenging tasks to which there is no definitive solution. This document describes a system for automatically romanizing Arabic names, called ARAN for Automatic Romanizer of Arabic Names, and some of the relevant linguistic issues. For example, ARAN can romanize a name like قابوس into a large variety of systems, such as /qaabuus/ (phonemic), Qabous (popular), \qAbws\ (Buckwalter), and [qɑːbuːs] (IPA).
Developed by our team of experts on Arabic orthography and phonology, ARAN is a versatile system that performs a full range of computational linguistic tasks required for processing Arabic names. Though the focus is on processing Arabic names, it can for the most part be applied to processing Arabic texts in general. ARAN consists of multiple modules that perform such tasks as phonetic and phonemic transcription, transliteration, name variant generation, vocalization, code conversion and language identification.
Our institute has spared no effort to tackle every aspect of the many tough linguistic challenges by doing meticulous research and analysis, by writing sophisticated algorithms, and by building comprehensive mapping tables. We are confident that the ARAN, along with its sister resource NANA (Non-Arabic Name Arabizer), represent the best of romanization and arabization technology today.
Ultimately, for a software tool to fully disambiguate an Arabic string requires it to "understand" the text based on a semantic/syntactic analysis of the context. Though ARAN does not do that yet, it is nonetheless a highly practical tool that adequately meets the practical needs of identifying, processing and normalizing names and their numerous variants useful in a variety of real world applications, such as:
- Information Retrieval, such as query processing by search engines.
- Named Entity Recognition and information extraction.
- Machine Translation, as for transcribing unknown proper nouns.
- Anti-money laundering and fraud detection by financial institutions.
- Security applications such as anti-terrorism watch lists and retail fraud.
- Cyber security applications such as for preventing identity theft.
- Law enforcement applications including most-wanted lists and deportation lists.
Romanization of Arabic has such uses as:
- Storage and manipulation of Arabic on platforms that don't support Arabic.
- Entering Arabic with ordinary keyboards on systems that only support ASCII.
- Enabling non-Arabic speakers to read Arabic in romanized transcription.
- Aiding language learners unfamiliar with the Arabic alphabet.
- Cross-Language Information Retrieval (CLIR) of names by entering romanized strings.
Why is Arabic ambiguous?
The Arabic script is a member of a class of Semitic scripts known as abjads. A distinguishing feature of abjads in general, and of Arabic in particular, is that words are written as a string of consonants with little or no indication of vowels. This is referred to as unvocalized Arabic (or unvoweled Arabic). Though diacritics, and some consonants, are used to indicate vowels, these are sparsely used. On the whole, unvocalized Arabic is ambiguous, in some cases highly ambiguous, posing significant challenges to Arabic information processing.
For example, the two letters مو \mw\ can theoretically represent 25 legitimate consonant -vowel permutations, such as mawa, mawwa, mawi, mawwi, mawu, mawwu, maw, maww, miwa, miwwa.... etc. Humans can normally disambiguate this by context, but for a computer program the task is formidable. An example of an ambiguous unvocalized word is كاتب \kAtb\, which can represent any of the seven vocalized wordforms below:
- كَاتِب /kaatib/
- كَاتَبَ /kaataba/
- كَاتِبٍ /kaatibin/
- كَاتِبٌ /kaatibun/
- كَاتِبَ /kaatiba/
- كَاتِبِ /kaatibi/
- كَاتِبُ /kaatibu/
The main reason for this ambiguity is that Arabic is a highly inflected language. Inflection is indicated by changing the vowel patterns as well as by adding various suffixes, prefixes, and clitics. A full paradigm for كَاتِب /kaatib/ 'writer' that we created (for a comprehensive Arabic-English dictionary project) reaches a staggering total of 3,487 (out of a thoeretical 10,541) vocalized forms, including identical forms of distinct function (called inflectional syncretism) and sense.
There is much confusion surrounding such terms as transliteration, transcription, and romanization. It is important to understand these concepts correctly. In the definitions below, the common name Muhammad, written محمد in Arabic script, is used for illustration. More information is available at Transliteration and Transcription Technology.
The representation of a language written in a non-Roman script, such as Chinese or Arabic, in the Roman or Latin alphabet. This includes transliteration and the various types of transcription described below.
|Arabic Letter||Contextual Form||Transliteration||Letter Name|
A representation of the script of a source language by using the characters of another script. It aims to represent the letters (graphemes), rather than the sounds (phonemes), of the source language, by one (sometimes multiple) characters in an unambiguous way. For example, محمد is transliterated as \mHmd\, with each Arabic letter represented unambiguously by one Roman character, as shown at right:
In good transliteration systems there is a one-to-one correspondence that enables round-trip conversion. A widely used system for transliterating Arabic on a letter-by-letter basis is the excellent Buckwalter transliteration.
Note that the term transliteration is often misleadingly used in the sense of transcription, which is very confusing and should be avoided.
A representation of the source script of a language in the target script in a manner that reflects the pronunciation of the original, often ignoring graphemic (character-to-character) correspondence. There are three kinds of transcription:
1. Phonetic Transcription
A set of symbols used is used to represent the actual speech sounds (phones) of the source language, including allophones (predictable variants of a phoneme). The most precise and well known of these is the International Phonetic Alphabet (IPA). For example, محمد is phonetically transcribed as [muħɛ̈mmɛ̈d], a fairly accurate representation of how that name is actually pronounced.
2. Phonemic Transcription
Also called phonological transcription, this is a notation used to represent the phonemes of the source language (ignoring allophones), ideally on a one-to-one basis. For example, محمد is phonemically transcribed as /muHammad/. The a represents the phoneme /a/, an abstract unit, rather than the actual sound (phone) [ɛ̈].
3. Popular Transcription
A conventionalized orthography, often inconsistent and devised by non-natives (or even by Arabists) with a shallow knowledge of Arabic phonology, that attempts to roughly represent the pronunciation of the original. For example, محمد is transcribed in some 200 different ways, such as Mohammed, Muhammad, Moohammad, Moohamad, Mohammad, Mohamad, etc.
The process of automatically adding vowels to unvocalized Arabic. For example, the unvocalized محمد \mHmd\ is vocalized as مُحَمَّد \muHam~ad\. Note the four diacritics that were added in the vocalized version. This is difficult to do even for native speakers unless trained in Arabic phonology. For a computer program, the high level of ambiguity makes it extremely challenging.
As used here, arabization refers to the process of automatically converting an Arabic or non-Arabic name written in the Latin or CJK native script into Arabic script. For example, Muhammad → محمد, Jack → جاك, and 埼玉 (Saitama) → سايتاما.
Arabic is written mostly in unvocalized script, which is why it is so difficult to transcribe and is the raison d'être for the ARAN system. Vocalized Arabic is found in the Koran, children's books, and didactic materials such as dictionaries. The Koran is fully vocalized (explicit short vowels, gemination, nunation etc.), but in other cases one often encounters partially vocalized or semivocalized texts.
ARAN supports three modes of vocalization: unvocalized, semivocalized, and fully vocalized, as shown at right:
Transcribing vocalized and semivocalized Arabic is considerably easier than transcribing unvocalized Arabic. However, it requires a different set of rules. Similarly, vocalizing unvocalized Arabic is just as difficult as transcribing it, but again requires a different set of rules. Each ARAN module has a knowledge base that captures the precise rules for the different vocalization modes.
ARAN: Automatic Romanizer of Arabic Names
Basic Goals and Methodology
ARAN aims to provide a robust solution to the difficult task of romanizing Arabic names, including all the transcription subtypes described above. CJKI is engaged in ongoing research and development efforts to enhance the functionality of the various ARAN modules, especially ATAN, ARAN's core module for generating phonemic and popular transcriptions. The main emphasis is on automatically transcribing unvocalized Arabic names into as many popular romanized variants as possible.
The most difficult challenge, the core problem to which ARAN provides a solution, is to make an intelligent guess at determining the vowels of unvocalized Arabic names and generating a list of likely candidates on the basis of statistical models and in-depth analysis of Arabic orthography. If a name is not found in our comprehensive Database of Arab Names (DAN), variants are generated in various romanization systems by linguistically advanced algorithms using a sophsticated knowledge base that captures the rules of Arabic orthography. DAN now has approximately six and a half million entries.
ARAN consists of the following components, described in more detail in the sections below:
- ATAN: Automatic Transcriber of Arabic Names
- AXAN: Automatic Transliterator of Arabic Names
- APAN: Automatic Phoneticizer of Arabic Names
- ADAN: Automatic Diacriticizer of Arabic Names
- AVAN: Automatic Variant Generator for Arabic Names
- AEAN: Automatic Encoder of Arabic Names
- AIAN: Automatic Identifier of ASBL Names
- ACAN: Automatic Converter of ASBL Names
The table at right illustrates the conversion processes performed by the principal ARAN modules using the Arabic name Qaboos (قابوس) as an example. It shows the data input to each module and the resulting output after processing. Each module is further described in more detail in the sections below. To get an overview of ARAN's features and capabilities, please study this table carefully.
|Conversion process||ARAN module||Input||Output||Remarks|
|Phonemic Transcription||ATAN||قابوس||/qaabuus/ْ||linguistic representation of phonemes|
|English Transcription||ATAN||قابوس||Qaboos||"Standard" English spelling|
|Popular Transcriptions||ATAN||قابوس||Qabuus, Qabus, Qabous, Qabooss, Qaaboos, Kaboos, Kabuus, Gabous...||some of the many popular variants|
|Phonetic Transcription||APAN||قابوس||[qɑːbuːs]ْ||scientific transcription in IPA|
|Unvocalized Transliteration||AXAN||قابوس||\qAbws\||Buckwalter transliteration of unvocalized Arabic|
|Vocalized Transliteration||AXAN||قَابُوس||\qaAbuws\||Buckwalter transliteration of vocalized Arabic|
|Diacriticization||ADAN||قابوس||قَابُوس||adding vowels (vocalization) and diacrtics to unvocalized Arabic|
|قابوس||converting non-Arabic to Arabic script|
ATAN: Automatic Transcriber of Arabic Names
The Automatic Transcriber of Arabic Names, or ATAN for short, is ARAN's core module for generating phonemic and popular transcriptions of Arabic personal names.
Because of the inconsistent nature of the various popular Arabic romanization systems, there are often many, sometimes dozens or even hundreds, of romanizations for the same name. ATAN supports most of the commonly used systems, and has a flexible architecture that enables the user to configure the system to support user-defined systems.
The table below shows some of the major romanization systems. Though transcription is handled by the ATAN module and transliteration by the AXAN module. For convenience examples of both are given below.
|ALC-LC||shwlwkh||Romanization standard of the American Library Association - Library of Congress.|
|IC||Shulukh||Intelligence Community Standard (.pdf).|
|DIN||šūlūḫ||DIN 31635, the DIN standard for Arabic transliteration.|
|BGN/PCGN||Shūlūkh||The official system adopted by the U.S. Board of Geographic Names (BGN) and the Permanent Committee on Geographical Names (PCGN)|
|IPA||ʃuːluːx||International Phonetic Alphabet, a scientific system of representing speech sounds.|
|English||Shoulokh||One of many possible popular transcriptions.|
|Buckwalter||$wlwx||A strict transliteration system widely used in information processing.|
In addition to the systems shown above, there are others not shown here, such as Deutsche Morgenländische Gesellschaft, ISO/R 233, SATTS and many that will be supported by the ATAN and AXAN modules.
AXAN: Automatic Transliterator of Arabic Names
The Automatic Transliterator of Arabic Names, or AXAN for short, generates transliterations of Arabic names or any other Arabic text. There are few strict transliteration systems; that is, systems that use unique symbols for each letter and allow for round-trip conversion. The excellent and widely used Buckwalter transliteration system is not only supported by AXAN, but is also used for internal processing in all ARAN databases and algorithms. AXAN can be configured to support other transliteration systems, including Cyrillization, by adding a custom mapping tables. Examples are shown in the table in Section 6. ATAN.
A table comparing romanization systems can be found at this Wikipedia article.
APAN: Automatic Phoneticizer of Arabic Names
The Automatic Phoneticizer of Arabic Names, or APAN for short, generates phonetic transcriptions of Arabic names in IPA. This represents the actual pronunciation in Modern Standard Arabic (MSA), including distinctions between the major allophones. APAN can be configured to generate transcriptions in various flavors of MSA pronunciation, e.g. the Saudi, Egyptian and Levantine flavors. Flavors refers to variations in the pronunciation of MSA in various regions of the Arab world, and is not to be confused with Arabic dialects.
For example, the name قابوس Qaboos is transcribed phonetically as [qɑːbuːs]. Note that the phonemic transcription /qaabuus/ generated by ATAN indicates the long vowel a by /aa/ and does not indicate the phonetic details of that vowel other than that it is long, a phonemic distinction. In contrast, the IPA phonetic transcription generated by APAN for this vowel is [ɑː], distinguishing it from its more common realization [æː], since [ɑː] is an allophonic variant of /aa/ that occurs after the uvular stop [q]. Thus the phonemic transcription /aa/ represents a single phoneme, which can be realized phonetically as [æː] or [ɑː].
This is further illustrated by the table below:
ADAN: Automatic Diacriticizer of Arabic Names
The Automatic Diacriticizer of Arabic Names, or ADAN for short, perfoms automatic diacriticization; that is, it automatically vocalizes (adds vowels and diacritics) to unvocalized or semi-vocalized Arabic and adds the appropriate vowel signs and other diacritics. For example, the well known name Muhammed, written محمد \mHmd\ in unvocalized Arabic, is converted into the vocalized version مُحَمَّد \muHam~ad\ (/muHammad/) by adding the diacritics damma, fatha and shadda. This is related to, but distinct from, the equally difficult task of automatically generating a romanized phonemic transcription, which is done by the ATAN module.
Below are some example of the output from the ADAN module.
AVAN: Automatic Variant Generator for Arabic Names
The many popular transcriptions of Arabic names result in a very large number of variants. One of the main factors contributing to this is that several Arabic consonants do not exist in European languages. These sounds are difficult to pronounce and are rendered in different ways when romanized. Another factor is the vowels, which are transcribed in a bewildering variety of ways, partially due to dialectical variation. For example, the Arabic vowel /u/ in /usama/> is transcribed in such different ways as Usama, Ousama, Osama and Oosama.
For more details on romanized variants of Arabic names, see our Database of Arab Names.
Arabic Orthographic Variants
The second kind of variant are variants in Arabic name itself. This could be of three kinds:
- Synonyms are alternative expressions that represent the same name, like امريكا \amríka\ (America) vs. الولايات الأمريكية المتحدة \AlwlAyAt Al>mrykyp AlmtHdp\ (United States of America).
- Orhographic variants are alternative, non-standard ways to spell a specific variant of a name, like ابو ظبي \Abw Zby\ instead of أبو ظبي \>bw Zby\ for Abu Dhabi, in which the hamza is omitted.
- Orhographic errors are frequently occurring, systematic spelling mistakes, like yaa' in ابو ظبي \Abw Zby\ (Abu Dhabi) being replaced by alif maqsuura in ابو ظبى \>bw ZbY\.
Though the difference between variants and errors cannot be rigorously defined (there may be differences of opinion among native speakers as to what constitutes an error), they are both based on deep statistical and linguistic analysis of contemporary Arabic orthography, and provide fairly exhaustive coverage of Arabic orthographic variation. It should also be noted that standard form, though linguistically correct, is not necessarily the most common one (we have statistics for the occurrence of each form).
|أبو ظبي||>bw Zby||Abu Dhabi||ابو ظبي||أبو ظبى
|V: omit hamza
E: alif maqsura replaces yaa'
|الإسكندرية||Al<skndryp||Alexandria||الاسكندرية||الإسكندريه||V: omit hamza
E: haa' replaces taa' marbuuTa
|جدة||jdp||Jeddah||جدّة||جده||V: explicit shadda
E: haa' replaces taa' marbuuTa
|الأردن||Al>rdn||Jordan||الاردن||V: omit hamza|
|بالو ألتو||bAlw>ltw||Palo Alto||بالو التو
|V1: omit hamza
V2: madda replaces hamza
|الرياض||AlryAD||Riyadh||الرّياض||V: explicit shadda|
|طوكيو||Twkyw||Tokyo||توكيو||E: taa' replaces Taa'|
For details see our Dictionary of Arabic Place Name Variants.
AEAN: Automatic Encoder of Arabic Names
AEAN is a code conversion module that supports various legacy encodings for Arabic, re-enconding the text into UTF-8 or UTF-16. It supports the following encodings:
- ISO 8859-6, the standard 8-bit encoding scheme for Arabic.
- The Arabic Mac Code Page, a superset of ISO 8859-6.
- Microsoft's Arabic DOS Code Page (ASMO 708), also based on ISO 8859-6.
- Microsoft's Arabic Windows code page is based on the ISO 8859-1 (Latin 1) standard.
- Arabic Windows 95 Code Page (CP-1256), which adds support for Persian characters.
AIAN: Automatic Identifier of ASBL Names
This module enables the automatic identification of a language written in the Arabic script. There are dozens of non-Arabic languages that are or have been written in the Arabic script, referred to as Arabic Script Based Languages (ASBL). The most important of these are:
- Farsi (official language of Iran)
- Pashto (western Pakistan and official language of Afghanistan)
- Dari (Afghan dialect of Farsi, official language of Afghanistan)
- Urdu (official language of Pakistan)
- Kurdish (Turkey, Iraq, Iran, Syria, Armenia, Lebanon)
Others include Shamukhi (Pakistani version of Punjabi), Kashmiri (India and Pakistan), and Uyghur (northwest China).
ACAN: Automatic Converter of ASBL Names
ARAN will eventually be expanded to romanize to/from the major Arabic Script Based Languages (ASBL), described at Section 12 above.