Comprehensive Database of Chinese Name Variants
The Problem of Name Variants
The number of personal names and their variants is probably in the billions. The number of place names is also large, but they have fewer variants. Identifying names and their variants is a difficult computational linguistic task. Named Entity Recognition (NER) is a hot topic in computational linguistics and plays an important role in many IT applications.
To enhance this technology, CJKI maintains comprehensive databases of several million proper nouns, especially of Japanese names, Chinese names, and Arabic names. This document describes some issues of Chinese name variation and provides samples of our Chinese name variants resources. For reference, see also The Role of Lexical Resources in CJK NLP Applications and Named Entity Contextual Clues.
Currently our databases contain over 1,650,000 Chinese seed names (surnames and given names) and approximately eight million romanized variants for these names.
Identifying, processing and normalizing names and their numerous variants are useful in a variety of applications, including:
- Anti money-laundering by financial institutions.
- Security applications such as identifying suspected name variants of terrorists and criminals.
- Query processing by search engines.
- Immigration control systems.
- Improving the accuracy of machine translation.
- Entity and information extraction.
- Segmentation and morphological analysis of CJK languages.
Large databases of name variants play a critical role in such applications. CJKI maintains databases of several million names and name variants in all major and most minor romanization systems for Chinese, Japanese and Korean, including the major Chinese dialects, as well as for Arabic and Spanish.
Chinese Name Variants
Chinese names can be spelled in a bewildering variety of ways. Our databases of Chinese names and non-Chinese proper nouns in both Simplified and Traditional Chinese, including romanized variants, contain nearly two million entries. There are several well-established systems for romanizing/transcribing Chinese, as well as various popular ones and many older ones that have fallen out of use. The principal systems and some of lesser importance are described below:
|Hanzi||驰骏||Given name written in Simplified Chinese characters. (Wikipedia article)|
|Hanyu Pinyin||Chíjùn||Usually referred to as piniyin, this the official, most widely used Mandarin romanization system adopted by the PRC in 1958, which has become ISO standard ISO-7098:1991. (Wikipedia article)|
|English||Chijun||Standard English spelling follows Hanyu Pinyin but omits the tone marks. (Wikipedia article)|
|Wade-Giles||Ch'ihchün||Introduced by Thomas Wade in the 19th century, this has been the most widely used system through most of the 20th century until Hanyu Pinyin has become widespread and is still important today. (Wikipedia article)|
|Yale System||Chrjyun||Developed by Yale University in the 1950s and 1960s to facilitate Chinese to Americans, the Yale system is of limited use now but does appear in some dictionaries and textbooks. (Wikipedia article)|
|Tongyong Pinyin||Chihjyun||The official romanization system adopted by the government of Taiwan in 2000 to replace the MPS II system. (Wikipedia article)|
|MPS II||Chrjiun||Formerly officially used in Taiwan to replace Gwoyeu Romatzyh, this system never gained much popularity outside of government publications and was replaced in 2000 by Tongyong Pinyin. (Wikipedia article)|
|Gwoyeu Romatzyh||Chyrjiunn||Formerly officially used in Taiwan, this system uses complex rules to distinguish tones without diacritics. Developed by Y. R. Chao and proclaimed in 1926, this system was officially replaced by MPS II in 1986. (Wikipedia article)|
|Zhuyin Fuhao||ㄢㄔˊㄩㄣˋ||Also called Bopomofo, this is the standard phonemic transcription system used in Taiwan (and formerly in the PRC) for education and input methods. Though not a romanization system, it is given here as it is of major importance in transcribing Chinese. (Wikipedia article)|
Our name variants databases provide comprehensive coverage for the major Chinese romanization systems and their variants. Two other systems also covered by CJKI's variant databases are EFEO, developed in the 19th century by Ecole francaise d'Extreme-Orient and is still in use in France, and Lessing-Othmer, used in Germany and based on German orthography.
Other systems such as MPS II are not currently supported because of their relative rarity (they are no longer official in Taiwan). There are various other systems, such as the ALA-LC system by the Library of Congress, which is essentially identical to Hanyu Pinyin except for the omission of tones. Dozens of other systems, such as Latinxua Sinwenz (拉丁化新文字; also known as "Sin Wenz") developed by Qu Qiubai in the 1920s,, have been used over the last few centuries to romanize or cyrillicisation Chinese, which are of little or no importance in the romanization of Chinese names today. Some of these are discussed here.
The table below shows examples of romanized Chinese names in the principal systems covered in our databases. Only the standard form is shown under the column for each system. Variants of each of these systems, such as forms without apostrophes and variant spellings, are given in the "Variants" column.
Variants Based on Chinese Dialects
Chinese has seven major dialect groups, and another four minor ones, with at least 95 or so subdialects. The romanization of Chinese names based on the various dialects is often radically different from Mandarin. Our databases of Chinese name variants cover some of the major Chinese dialects, including Cantonese, Hakka and Hokkien, as well as multilingual equivalents including Traditional Chinese, Japanese, Korean and Vietnamese.
|Yejing||MOE (Korean Ministry of Education Romanization)|
|Yejing||NRS (New Romanization System)|
|Yejing||KLS (Korean Language Society Romanization)|
|Yecing||ISO DPRK (Used in North Korea)|
|Yejing||ISO ROK (Used in South Korea)|
|CANTONESE||Yipging||LAU (Sidney Lau)|
|Yipking||CPR (Cantonese Popular Reading)|