
The CJK Dictionary Institute is engaged in the development and continuous expansion of comprehensive lexical databases for CJK languages and Arabic consisting of approximately eight million entries (see CJK Lexical Resources for details). This document describes our database of Arabic place names. We also maintain a large Database of Arab Names (DAN), with over 2.4 million romanized Arab names and variants.
Though Arabic has become a world language of critical importance, lexical resources, especially for proper nouns, are either scarce or exist only on a small scale. Because of the important role place names play in such natural language applications as named entity extraction (NER) and machine translation (MT), we are continuously expanding and revising our Database of Arabic Place Name Variants (DAP), which provides systematic coverage of Arabic orthographic variants and common orthographic errors.
It is important to note that although there are a handful of machine translation packages and data providers that offer Arabic place names, their coverage is poor, the data contains many machine-generated errors, and they do not cover variants. Our project may well be the first attempt to build a comprehensive database of Arabic place names that covers the entire world, is accurate, validated, and based on state-of-the art techniques in computational lexicography. Please have a look at the data samples shown below.
Identifying, processing and normalizing place names and their numerous variants is useful in a variety of applications, such as:
- Improving the accuracy of English-to-Arabic machine translation by providing the standard, correct Arabic form.
- Improving the accuracy of Arabic-to-English machine translation by identifying variants and errors in the original Arabic text.
- Place name dictionaries for human translators.
- Entity and information extraction.
- Segmentation and morphological analysis of Arabic texts.
Our database covers both the Arab and non-Arab world, including variants. Only the most common variants are shown in the sample below -- see the next section for more.
| Arabic | Buckwalter Transliteration | English | Variant | Error | Country |
|---|---|---|---|---|---|
| Arabic | Buckwalter Transliteration | English | Variant | Error | Country |
| أبو ظبي | >bw Zby | Abu Dhabi | ابو ظبي | أبو ظبى, ابو ظبى | UAE |
| الإسكندرية | Al<skndryp | Alexandria | الاسكندرية | الإسكندريه | Egypt |
| الجزائر | AljzA}r | Algiers | الجزاير | Algeria | |
| برازيليا | brAzylyA | Brasilia | برازيلية | برازيليه | Brazil |
| القاهرة | AlqAhrp | Cairo | القاهره | Egypt | |
| الشرق الاقصى | Al$rq AlAqSY | Far East | الشرق الاقصي | N/A | |
| ألمانيا | >mAnyA | Germany | المانيا | Germany | |
| الجيزة | Aljyzp | Giza | الجيزه | Egypt | |
| حيفا | HyfA | Haifa | حيفة | Israel | |
| جدة | jdp | Jeddah | جدّة | جده | Saudi Arabia |
| القدس | Alqds | Jerusalem | Israel | ||
| المنامة | AlmnAmp | Manama | المنامه | Bahrain | |
| مكة | mkp | Mecca | مكه | Saudi Arabia | |
| نابلس | nAbls | Nablus | Palestinian Territory | ||
| نانجينغ | nAnjyng | Nanjing | China | ||
| بالو ألتو | bAlw >ltw | Palo Alto | بالو التو, بالو آلتو | USA | |
| الرياض | AlryAD | Riyadh | الرّياض | Saudi Arabia |
Orthographic variants and errors of well-known place names are shown in the table below. This sample contains American, Egyptian, Emirati, Chinese and Japanese place names. Data is ordered first by English, and then by the web frequency of the Arabic.
Database of Arabic Place Names (sample) English Arabic Frequency Abu Dhabi أبوظبي 14194320Abu Dhabi ابوظبي 09564310Abu Dhabi أبو ظبي 06035820Abu Dhabi ابو ظبي 02534770Abu Dhabi أبوظبى 00000436Abu Dhabi ابوظبى 00000435Abu Dhabi أبو ظبى 00000121Abu Dhabi ابو ظبى 00000079Alexandria الإسكندرية 04009670Alexandria الاسكندرية 02553390Alexandria الأسكندرية 00605150Alexandria الاسكندريه 00000439Alexandria الأسكندريه 00000073Alexandria الإسكندريه 00000048Alexandria الاسكندريا 00000020Alexandria الاسكندريى 00000008Alexandria الإسكندريا 00000003Alexandria الأسكندريا 00000002Fukuoka فوكوكا 00044800Fukuoka فوكووكا 00002500Fukuoka فوكوأوكا 00001500Fukuoka فكوكا 00000284Fukuoka فوكواوكا 00000277Fukuoka فوكؤوكا 00000227Kansas City كانزاس سيتي 00001060Kansas City كانساس سيتي 00000781Kansas City مدينة كانساس 00000658Kansas City كنساس سيتي 00000479Kansas City مدينة كنساس 00000332Kansas City كانسس سيتي 00000058Kansas City مدينة كانزاس 00000045Kansas City كانزس سيتي 00000033Kansas City مدينة كانسس 00000021Kansas City مدينة كنزاس 00000008Kansas City كنزاس سيتي 00000007Nanjing نانجينغ 00002550Nanjing نانجينج 00000822Nanjing نانكينج 00000122Nanjing نانكينغ 00000040Nanjing نانغينغ 00000005New Jersey نيوجيرسي 00008410New Jersey نيوجرسي 00008030New Jersey نيو جيرسي 00004470New Jersey نيو جرسي 00001190New Jersey نيوجرسى 00000689New Jersey نيوجيرسى 00000542New Jersey نيو جيرسى 00000440New Jersey نيو جرسى 00000100The table below shows various orthographic variants and common errors for االإسكندري, the Egyptian city of Alexandria, along with Google occurrences (there are many other variants involving partial vocalization). Our databases are now being expanded to systematically include all orthographic variants and errors based on statistical analysis of Arabic orthography as it currently occurs in corpora, and often include the fully vocalized versions as well (see Database of Arabic Proper Nouns for a sample).
Our Arabic place names are carefully proofread to ensure strict adherence to the complex rules of hamza orthography, something which is often ignored outside of publications of the highest editorial standards. The result of this strict editorial policy is that we can provide not only the linguistically correct standard MSA version, but also all common non-standard and incorrect versions as well, carefully flagged to distinguish between them, as shown in the table below.
| Rank | Type* | Arabic |
Buckwalter Transliteration |
Frequency |
Remarks |
|---|---|---|---|---|---|
1 | N | الاسكندرية | AlAskndryp | 02930000 | Normalized, no hamza |
2 | S | الإسكندرية | Al<skndryp | 00690000 | Standard form, with hamza |
3 | E | الاسكندريه | AlAskndryh | 00089200 | No hamza, taa' marbuuta replaced by haa' |
4 | V | الإسكندريّة | Al<skndry~p | 00000954 | Explicit shadda |
5 | E | الإسكندريه | Al<skndryh | 00000897 | taa' marbuuta replaced by haa' |
6 | V | الاسكندريّة | AlAskndry~p | 00000245 | no hamza, shadda explicit |
7 | E | الاسكندريا | AlAskndryA | 00000080 | hamza omitted, taa' marbuuta replaced by alif |
8 | V | الإسْكَنْدَريَّة | Al<sokanodary~ap | 00000024 | fully vocalized |
9 | E | الاسكندريّه | AlAskndry~h | 00000012 | no hamza, shadda explicit, taa' marbuuta replaced by haa' |
10 | E | الإسكندريا | Al<skndryA | 00000007 | taa' marbuuta replaced by alif tawiila |
11 | E | الإسكندريّه | Al<skndry~h | 00000005 | taa' marbuuta replaced by haa', shadda explicit |
| * V = variant; E = error; S = Standard; N = normalized | |||||
In addition to the above, our database contains many other variants, such as those with partial and full vocalization, covering all actual and potential variants. The full set of Alexandria variants includes 35 entries.