Dictionaries
Resources
Consulting
Japanese
Chinese
Korean
Arabic
Articles
Dictionaries
Websites
Articles/papers
What is CJKI?
What is KDPS?
Jack Halpern
Links
|
Linguistic and Technical Documents
This page brings together some linguistic and technical documents written by Jack
Halpern, aimed at introducing the CJK languages, with emphasis on the
linguistic issues to be addressed in developing CJK linguistic tools.
Index to this Page
News Flash
- Presented at LREC 2008 (May/June, 2008):
Exploiting Lexical Resources for Disambiguating Orthographic CJK and Arabic Orthographic Variants (.pdf file, 433K) This paper analyzes the principal linguistic issues of Arabic and CJK orthographic variation and argues that linguistic knowledge supported by large-scale lexical databases is essential for accurate disambiguation.
- Presented at CAASL: (July, 2007)
The Challenges and Pitfalls of Arabic Romanization and Arabization (.pdf file, 293K) was presented at The Second Workshop on Computational Approaches to Arabic Script-based Languages (CAASL2) conference held at Stanford University. This paper focuses on the linguistic issues encountered in developing unique systems for the automatic romanization of Arabic names and the arabization of non-Arabic names that can arabize CJK names directly.
Japanese Information Processing
-
The Role of Lexical Resources in CJK Natural Language Processing (.pdf file, 358K) A linguistic description of the principal challenges to be overcome by developers of CJK NLP application, this paper was presented at workshops of COLING/ALC 2006 in Sydney as well as other conferences. It appears in various proceedings and journals, such as Lecture Notes in Computer Science.
-
The Challenges of Japanese Speech Technology A linguistic description of the principal challenges to be overcome by developers of Japanese speech technology and the role of phonological
databases.
-
Lexicon-based Orthographic Disambiguation in CJK
Intelligent Information Retrieval
Presented at
COLING 2002 (Taipei August 2002), this paper analyzes the
linguistic issues of CJK orthographic variation, including Japanese,
and discusses why lexical databases should play a central role in NLP.
-
The Challenges of Intelligent Japanese
Searching
This paper analyzes in detail the linguistic issues related to
orthographic variation in Japanese, and discusses advanced
information retrieval technologies such as cross-script and
cross-orthographic searching for use in intelligent IR.
-
Orthographic Variation in Japanese
The highly irregular orthography and morphological complexity of
Japanese pose formidable challenges to software developers. This
report focuses on orthographic variation and analyzes the linguistic
issues in developing Japanese linguistic tools.
-
The Complexities of Japanese Homophones
Explains the subtle distinctions between the numerous homophones in
Japanese, and shows why homophone processing deserves special
attention in Japanese information retrieval.
-
Cross-Synonym and Cross-Language Searching in Japanese
Describes the linguistic issues to be addressed by advanced Japanese
information retrieval technologies, focusing on cross-language
and cross-synonym searching.
-
Morphological Attributes in Japanese
Describes the derivational affixes and binding valency
in our Japanese lexical database, particularly useful for
disambiguating Japanese lexemes in such applications
as search engine query processing.
-
Japanese Lexical Resources.
A guide to our comprehensive Japanese lexical databases
consisting of over three million entries and other resources.
The Japanese Language
-
Outline of Japanese Writing System
An fairly detailed introduction to the Japanese writing system,
including the birth of the Chinese characters, the function of
kanji in Japanese, and a description of the various scripts used in
Japanese.
-
Building a Comprehensive Chinese Character Database
Presented at Euralex '94, an international congress on lexicography in
Amsterdam, this paper describes how we began to develop DESK, our comprehensive
CJK lexical databases, on the basis of the
New Japanese-English Character Dictionary.
-
Kana and Romanization
A detailed introduction to the hiragana, katakana, and romaji scripts,
which together with kanji constitute the complex Japanese writing
system.
-
A Brief Introduction to Japanese Morphology
Describes the principal word-formation processes in Japanese, with
special emphasis on the function of kanji as word elements and bound
affixes.
Chinese Information Processing
-
The Role of Lexical Resources in CJK Natural Language Processing (.pdf file, 358K) A linguistic description of the principal challenges to be overcome by developers of Chinese NLP application.
-
Lexicon-based Orthographic Disambiguation in CJK
Intelligent Information Retrieval
This paper analyzes the
linguistic issues of CJK orthographic variation, and discusses why
lexical databases should play a central role in disambiguation.
-
The Pitfalls and Complexities of Chinese to Chinese Conversion
Presented at several international conferences, this academic paper
presents an in-depth analysis of the linguistic and technical issues
related to converting Simplified Chinese to/from Traditional Chinese.
-
Orthographic Variation in Chinese
This report focuses on the complexities of orthographic variation
in Chinese, analyzes the linguistic issues in developing Chinese
linguistic tools, and describes the major differences between
Traditional and Simplified Chinese.
-
Variation in Traditional Chinese Orthography
Traditional Chinese does not have a stable orthography. This short
document describes the various types character form variants and how
they relate to each other.
-
Chinese Lexical Resources.
A guide to our comprehensive Chinese lexical database
consisting of about three million Simplified and Traditional Chinese entries
and other resources.
Korean Information Processing
-
Lexicon-based Orthographic Disambiguation in CJK
Intelligent Information Retrieval
This paper analyzes the linguistic issues of CJK orthographic variation,
including Korean,and discusses why lexical databases should play a central role
NLP.
- Orthographic Variation in Korean
This report focuses on Korean orthographic variation and analyzes the
linguistic issues to be addressed when developing Korean linguistic
tools, especially intelligent information retrieval tools.
-
Korean Lexical Resources.
A guide to our Korean lexical database and other resources.
Other languages
The Challenges and Pitfalls of Arabic Romanization and Arabization (.pdf file, 293K) This paper focuses on the linguistic issues encountered in developing unique systems for the automatic romanization of Arabic names and the
arabization of non-Arabic names that can arabize CJK names directly.
-
Is English Segmentation Trivial?
Describes the principal word-formation processes in English, and
demonstrates that word segmentation in English, contrary to popular
belief, is far from trivial.
-
Criteria for Inclusion of Multiword Lexical Units in Dictionaries
Coming Soon.
-
European and Semitic languages
Coming Soon. A series of reports describing the features of the major
European and Semitic languages, focusing on orthographic variation, and
describing the linguistic issues to be addressed in developing linguistic
tools.
|