Chinese Morphological Database

CMD

汉语词缀数据库



The CJK Dictionary Institute maintains a comprehensive lexical database of about three million Simplified and Traditional Chinese entries covering a broad spectrum of fields including proper nouns, technical terminology and general vocabulary. This is a sample of our Chinese Morphological Database, a comprehensive database of Chinese derivative affixes with adjacency attributes.

A derivational affix (DA) is a bound morpheme (though some also function as free forms) prefixed or suffixed to a base to create new words. In traditional morphology, DAs do not have lexical meanings of their own, and only add grammatical meanings. Here, we include "lexical affixes" -- compound-forming word elements that have a substantial lexical meaning of their own. Identifying DAs is very useful in NLP, IME and information retrieval applications as they significantly contribute to the accuracy of algorithmically identifying countless lexemes not registered in the lexicon.

An important principle in our criteria for selecting an affix is its ability to combine with a base consisting of two or more characters, as 迷 'fan' combining with 独轮车 'unicycle' to produce 独轮车迷 'unicycle fan'. If an affix combines with only single-character bases, it is excluded because of the danger of confusing it with two-character compounds in which it does not function as an affix, as in 入迷, or with a coincidental juxtaposition of a free form.

An adjacency attribute is a part of speech (POS) code that indicates the morphological restrictions that apply to adjacent words or DAs when these are actually used in the formation of compound words or affixed lexemes. Adjacency attributes help programs identify DAs with greater reliability, especially in systems that fully support POS-tagging. For more details, see japaffix.htm.



Adjacency Attribute Fields
TYPE
[A1] productive derivational suffix -- always bound
[A2] Productive derivational suffix -- almost always bound
[A3] Productive derivational suffix -- sometimes bound
[B1] Historically productive derivational prefix -- always bound
[B2] Historically productive derivational prefix -- sometimes bound
POS Part of speeech code. For details see chinpos.htm
BEFORE POS of lexeme or base preceding a suffix, e.g. "NC" for the suffix 迷 'fan' means that 迷 can be preceded by a common noun, as 独轮车 'unicycle', to produce 独轮车迷 'unicycle fan'.
AFTER POS of lexeme or base following a suffix, e.g. "NC" for the prefix 半 'semi-' means that 半 can follow a common noun, as 文盲 'illiterate', to produce 半文盲 'semiilliterate'.
RESULT The POS of the lexeme resulting from affixing a prefix or suffix. For example, "NC" for 独轮车迷 'unicycle fan' means that 独轮车迷 is a common noun.


SC ID SC Affix TC Affix POS Code TYPE Code Pinyin Before After Result
S0007529AWSA3xian4NP NP
S0009543AWSA3tuan2NC V  NC
S0010532BWSA3chu4NC V NC
S0010875AaWSA3tou0NC NC
S0015201AdWPA2zong3 NC VNC V
S0034279AWSA3jie2NC V NP NC
S0047893AWSA3zhen4NP NP
S0061252AaWSA2yan2NC NC
S0064269AWSA3hua4NC V A D NC V A
S0070103AdWSA1ji1V NC A NC
S0072424AbWSA3gui3A NC V NC
S0078485AaWSA1xing2NC V A NP NC
S0084233AhWPA3hao3 NC ANC A
S0084666AWSA3gong1V NC NC
S0098752AaWSA2zhe3NC V A NC
S0096010AdWSA3shou3NC V NC
S0101751AaWSA2suo3NC V NA NC
S0106449AbWSA3xin1NC V A NC
S0112789AbWSA3xing4A NC V NC
S0112870AgWSA3sheng1NC V A NC
S0123643AWSA3zu2NP NC V A NC
S0121387AhWPA3duo1 NCNC
S0119011AWPA3da4 NC VNC V A D
S0120518AgWPA1di4 NNNC
S0128279AdWPA3chao1 NC ANC A D
S0138060AWSA3pai4NP NC A V A NC
S0142229AfWPA2ban4 NC V ANC V A
S0142513AdWPA3fan3 NC A VNC V A
S0143043AWSA2fan4V NC A NC
S0141475AdWPA1wei1 NC NM VNC
S0144731AeWSA3pin3V NC A NC
S0148106AWSA3bu4NC V NC
S0148384AeWPA2fu4 NCNC
S0157840AWSA3mi2NC V NC
S0164882AaWSA2lv4V NC A NC
S0165711AfWPA3lao3 NC A VNC