kanji logo CKJI logo

WORLD'S LARGEST DATABASE OF ARAB NAMES

قاعدة بيانات الأسماء العربية

For Immediate Release

The CJK Dictionary Institute, which specializes in the compilation of large-scale CJK and Arabic lexical resources, is pleased to announce the release of a major expansion of our comprehensive Database of Arab Names, referred to as DAN, which now covers about 2.4 million entries based on over 20 million source variants.

DAN covers Arab personal names in both the roman and Arabic scripts and includes numerous orthographic variants and other attributes such as web frequency, name type codes and normalized forms. Based on authoritative linguistic resources, DAN is undergoing a major expansion and extensive proofreading by a team of Arabic native speakers, and expected to grow substantially in the coming months to cover all the major countries in the Middle East.

Key features of DAN
  • 2.4 million validated Arabic name variants.
  • Ideal for security and anti-money laundering, and NLP.
  • Based on over 20,000,000 source names from authoritative resources.
  • Proofread by native editors trained in Arabic phonology.
  • Validated against the web and corpora.
  • Fully vocalized with various variants in Arabic script.
  • Web-based frequency statistics for each name.
  • Various romanization systems, such as the official IC standard.
  • Fully supports OFAC names, their official aliases and unofficial variants.

DAN is playing an important role in helping software developers, especially of security applications and NLP tools, enhance their technology by enabling named entity recognition and extraction, machine translation (MT),variant normalization, and information retrieval (IR) of Arabic names.

Jack Halpern Jack Halpern CEO
Tyler Reid Tyler Reid Project Manager
Aaron Chmielowiec Aaron Chmielowiec CTO

What Makes DAN Unique? 

In the interview below, CJKI's CTO Aaron Chmielowiec (AC) interviews Project Manager Tyler Reid (TR) and CEO Jack Halpern (JH) to clarify the special features of DAN that make it unique in the history of Arabic lexicography.

AC Tyler, what is your role in this project?
TR I am truly excited about the project. I am responsible for coordinating the activities of our team of Arabic editors, for ensuring the integrity and consistency of the data, and for developing the tools for processing the data and merging it into our master database.
AC What makes this database special?
TR First, it is of unparalleled size. As of March 31, it is approximately one and a half million unique validated entries, in addition to over ten million (linguistically valid) variants based on about 50 million potential variants. This means unique first and last names, like Mohamed and ElBaradei, not full names, like Mohamed ElBaradei. Of equal importance are the various distinctive features unique to DAN. For example, not only does DAN provide almost exhaustive coverage of romanized variants, but it also provides orthographic variants of the Arabic spelling of each name. Take a look at the variants of Abdul Al Aziz in Table 1 on the right.
JH Let me add that DAN also has both vocalized and unvocalized versions of the Arabic names, and sometimes multiple vocalizations for the same name. Full and accurate diacritics are provided, even such relatively rare ones as alif-wasla and dagger alif. This is not only of academic interest, but is also a practical means to ensure that we can provide romanized versions of great accuracy and variety.
AC Can you briefly describe some of the techniques you use?
TR I would say that the key to the success of our project is our in-depth understanding of the linguistic issues related to the Arabic script and how these affect romanization. For the last four years we have been compiling DAN on the basis of various resources such as websites, corpora, books and dictionaries with the help of a team of native speaker editors specially trained for this project. In addition, we have a set of tools fine tuned over the years for processing Arab names, as well as the cooperation of specialists in Arabic information processing. Recently, our CEO Jack Halpern presented a paper (.pdf file, 293K) at the CAASL2 conference in Stanford in which our tools and methodology are explained.
AC Can you give us an example of a name with many variants?
TR Yes. Take the common name Mahmoud, as shown in Table 2. Our database has 57 romanized spelling variants, including such rare ones as Mechmoud. Some names can have 200 variants or even several hundred in extreme cases. Our database also has numerous variants in the original Arabic script. Jack, can you explain this?
JH Sure. Arabic names are spelled with or without a hamza over the alif, sometimes a shadda appears and sometime not, sometimes a madda is not written over the alif, and the like. Other than variants, there are also common errors such as yaa' being replaced by alif maqsuura and taa' marbuuta being replaced by haa'. DAN lists all these variants to ensure maximum recall in information retrieval. In short, DAN provides comprehensive coverage for both Arabic and roman variants as well as common spelling errors, an example of which appears in Table 3.
AC Can you give some examples of how DAN is actually used?
TR DAN is an extremely useful resource in such practical applications as:
  • Information retrieval.
  • Named entity extraction.
  • Machine translation.
  • Automation transcription.
  • Security applications such as anti-money laundering, watch lists, and identity theft prevention.
AC Any final thoughts?
JH We are truly excited about the future of DAN as it grows to become the world's largest repository of Arabic names with many useful attributes. Some major financial and security organizations as well as software developers are already using it to enhance their Arabic name processing technology. The demand for DAN is growing, and we are redoubling our efforts to make it the best and biggest database of its kind.
Table 1. Variants of Abdul Aziz
VariantWeb
Frequency
عبدالعزيز11800000
عبد ألعزيز03400000
عبد العزيز03400000
عبد إلعزيز03400000
عبد لعزيز00002210
عبدلعزيز00000516
عبدألعزيز00000208
عبدإلعزيز00000005
Table 2. Variants of محمود
VariantWeb
Frequency
Mahmoud 0020400000
Mahmud 0005770000
Mahmood 0004050000
Mahmut 0003780000
Mehmood 0000685000
Mahmod 0000138000
Mahamud 0000121000
Machmud 0000108000
Mehmud 0000082400
Mahmoed 0000052100
Mechmod 0000048000
Makhmud 0000042700
Machmut 0000006640
Mehmut 0000002400
Mahmout 0000001280
Machmoud 0000001110
Table 3. Variants of Alexandria
VariantWeb
Frequency
الاسكندرية02930000
الإسكندرية00690000
الاسكندريه00089200
الإسكندريّة00000954
الإسكندريه00000897
الاسكندريّة00000245
الاسكندريا00000080
الإسْكَنْدَريَّة00000024
الاسكندريّه00000012
الإسكندريا00000007
الإسكندريّه00000005

Arabic Resources

Various other Arabic resources developed by our institute are described at: http://www.cjk.org/cjk/arabic/arabsam.htm

About CJKI

CJKI has become one of the world's prime resources for CJK lexical resources, and is contributing to CJK information processing technology by providing high-quality lexical resources and consulting services to some of the world's leading software developers and IT companies, including Fujitsu, Sony, Google, Microsoft, Yahoo and Amazon.

Press Contact

Jack Halpern
The CJKI Dictionary Institute