BUILDING A COMPREHENSIVE CHINESE CHARACTER DATABASE

Summary of paper presented at Euralex '94, an international congress for lexicographers held at the Free University Amsterdam from August 30 to September 3, 1994, plus appendixes for reference.

Showa Women's University, Institute of Modern Culture
KANJI DICTIONARY PUBLISHING SOCIETY
Š¿ ‰p Žš “T Š§ s ‰ï
1-3-502 3-Chome Niiza
Niiza-shi, Saitama 352 JAPAN
PHONE: +81-48-481-3103 FAX: +81-48-479-1323

JACK HALPERN
Research Fellow at Institute of Modern Culture
Editor in Chief of Kanji Integrated Tools Project
Editor in Chief of New Japanese-English Character Dictionary

MASAAKI NOMURA
Professor of Japanese
Center for Japanese Language,
Waseda University

ATSUSHI FUKADA
Assistant Professor of Applied Linguistics
Center for Linguistic and Cultural Research,
Nagoya University

Index

A B S T R A C T

7. SYSTEM ANALYSIS AND DATABASE DESIGN
1. BACKGROUND 8. DEVELOPMENT OF DATABASE SYSTEM
2. PROJECT AIMS 9. DEVELOPMENT OF KIT APPLICATIONS
3. PROJECT OUTLINE REFERENCES
4. LEXICAL SEMANTICS AND COMBINATORICS APPENDIX A
5. COMPUTATIONAL LEXICOGRAPHY APPENDIX B
6. DATA AND CODE CONVERSION APPENDIX C

A B S T R A C T

The New Japanese-English Character Dictionary was designed to provide an in- depth understanding of how kanji are used in contemporary Japanese. One aim of this project is to use NJECD to build a comprehensive database with detailed information on how Chinese characters are used in Chinese, Japanese and Korean, including printed/calligraphic forms, in-depth semantics, phonemics, encoding methods, indexing schemes, synonyms, homophones, and voluminous reference data. A second aim is to use this database to compile about forty applications and spinoff products for pedagogical and research purposes, including learner's dictionaries, reference manuals, and CALL software by integrating lexical semantics and combinatorics with computational lexicography.

1. BACKGROUND

Although Japanese has been the subject of various linguistic studies, little attention has been given to the systematic analysis of its writing system. Kanji (Chinese characters as used in Japanese) are combined with each other to generate countless compound words, and function as a network of interrelated parts. Though this is vaguely recognized by educators, it has been largely disregarded in the compilation of character dictionaries. The demand for effective tools for mastering the Japanese script has been growing at an unprecedented pace. Learners are in urgent need for dictionaries that systematically address the special problems of non-Japanese students.

The New Japanese-English Character Dictionary (NJECD) (Halpern 1990, 1993) was compiled with the aim of creating a lookup tool that provides an in-depth understanding of the meanings and functions of high-frequency characters in contemporary Japanese. The dictionary departs from traditional kanji lexicography in several ways: (1) the *core meaning* defines the dominant character sense; (2) detailed meanings show how single-character morphemes generate numerous compounds; (3) psychologistic ordering reveals the logical/hierarchical interrelatedness between senses; (4) the System of Kanji Indexing by Patterns (SKIP), a new method for rapid retrieval of entries; and (5) precise distinctions between synonyms, homophones, and orthographic variants (for further details, see Halpern 1990, EURALEX '90 Proceedings).

2. PROJECT AIMS

This project aims to contribute to Sino-Japanese studies in general, and to Japanese language studies in particular, in the following four areas:

  1. To use NJECD as a basis for creating a comprehensive kanji information database system, which will be referred to as DESK (Database System for Kanji). This database contains detailed information on the use of Chinese characters in Chinese, Japanese and Korean (CJK languages).
  2. To use DESK as a basis for compiling about forty applications and spinoff products for pedagogical and research purposes.
  3. To provide a comprehensive source of reference data on Chinese characters for pedagogical, linguistic and lexicological research. Some of these data will be made available on the Internet, with certain restrictions to avoid copyright violations.
  4. To promote basic research on computational lexicography by establishing methodology for building integrated dictionary databases, especially multilingual databases for storing lexicographic data in a CJK environment.

3. PROJECT OUTLINE

To achieve these aims, the Kanji Dictionary Publishing Society was established in late 1993 as a part of the Institute of Modern Culture at Showa Women's University. The Society is directed by the Editorial Committee, which consists of renowned experts in Japanese linguistics, and is financed by the University and various foundations (1994 budget about US$250,000).

The DESK database is being used for compiling about forty computer-edited applications and spinoff products, including teaching and learning aids such as learner's dictionaries and reference manuals, foreign languages editions such as a German edition of NJECD, software packages such as CAI/CAL courseware, electronic books and learning machines, and so on. This series of products will be referred to as KIT, which stands for Kanji Integrated Tools.

During the initial phase of the project, which will be completed in mid-1994, the framework and principal components of DESK will be created, and the electronic book (EB) edition of NJECD will be published. Concurrently, the building a pilot system for a pocket edition of NJECD is in progress, which will also be completed in mid-1994.

The following KIT applications will be either published or finalized for publication over a period of two to three years:

  1. New Japanese-English Character Dictionary: Electronic Book Edition
  2. New Kanji-English Pocket Dictionary
  3. New Kanji-English Learner's Dictionary
  4. Kanji Input System Based on System of Kanji Indexing by Patterns
  5. Comparative Study of Sino-Japanese Lexical Items
  6. Kanji Cards
  7. Japanese-English Dictionary of Kanji Synonyms
  8. Japanese-English Dictionary of Kanji Usage

The EB edition of NJECD is scheduled for publication in the summer of 1994 in time for presentation at Euralex '94. This is the first kanji-English dictionary based on CD-ROM technology. It incorporates all the features of NJECD, including core meanings, independent words, homophone/synonym discrimination, compounds, radicals, a kanji thesaurus, and much more. A hierarchical menu system enables the user to easily retrieve information by specifying single or multiple keywords in normal or wordend searches, such as readings, radicals, core meanings, SKIP patterns and stroke-count. This, combined with a comprehensive cross-reference network, provides the user with multiple search paths to access information with maximum speed and facility.@

4. LEXICAL SEMANTICS AND COMBINATORICS

The principal semantic component of DESK was compiled by submitting single- character morphemes to an exhaustive semantic analysis. The meanings were analyzed by such techniques as componential analysis and an in-depth examination of the differences and similarities between near-synonyms, which served as a powerful technique for establishing precise character meanings.

Each meaning was analyzed into its single senses, and its relationships to other members of the same synonym group were examined and compared. That is, the denotation, connotation, and range of application of each sense were carefully studied in contrast with those of their near-synonym counterparts, with emphasis on how the single senses of wordforming elements are influenced not only by normal syntagmatic relations, but also by often subtle semantic/functional distinctions dependent on the morphophonemic context. For example, whereas the Chinese-drived (*on*) bound morpheme —w yoo means 'popular song' in such compounds as –¯—w minyoo€'folk song', the native Japan- ese (*kun*)form —w utai refers to the chanting of a noh text.

5. COMPUTATIONAL LEXICOGRAPHY

Although every phase of the compilation and editing of NJECD was computerized, we faced great difficulties in the initial stages. MS-DOS and database management systems were not yet in widespread use, and the level of PC technology was hardly up to the task. Nevertheless, the lack of funds and technical expertise led us to select Fujitsu's FACOM-9450 series, the most advanced PC on the market at the time, rather than mini-computers.

To compile, process, and proofread the data for NJECD, we wrote about 700 programs in BASIC and used spreadsheets and other software packages from the mid-eighties, and had to resort to a series of ingenious tricks to force the hardware and software to perform tasks they were not designed for. An inevitable consequence of this was data files of complex structure, quite unlike the logically organized relational database files of today.

To produce KIT applications in a short period with maximum efficiency, it was essential to integrate state-of-the-art computer technology with such disciplines as computational lexicography and lexical semantics to restructure the data into a rationally-organized database system (DESK), and to write software for developing applications drawing data from the database. The work of building the database and application development is outlined below.

6. DATA AND CODE CONVERSION

The character set of the computers used to compile NJECD, Fujitsu's now obsolete FACOM-9450 series, supported only Level 1 characters of JIS C 6226- 1978. Since hundreds of characters were missing from the latter, we were forced to customize it by creating hundreds of user-defined characters and remapping hundreds of JIS Level 2 characters to JIS Level 1 codes. This resulted in a character set basically incompatible with current character set standards, national or corporate.

To ensure easy portability to a wide range of hardware and software platforms, we converted the data to the Shift-JIS code system and updated it to JIS X 0208-1990. In addition, we restored the remapped codes and either recreated or remapped user-defined characters not present in JIS X 0208-1990, if necessary by mapping into the supplemental character set JIS X 0212-1990, or the ISO 10646/Unicode character set, in that order. This approach, although complex, yielded excellent results by keeping user-defined characters to a bare minimum and ensuring maximum portability. It was suggested by Ken Lunde, an expert on Japanese encoding methods, who has written a definitive work on the subject (Lunde 1993).

7. SYSTEM ANALYSIS AND DATABASE DESIGN

Each entry character is associated with numerous attributes, such as a core meaning, various readings, multiple senses for each reading, and stylistic labels, and is also a member of various cross-reference networks. For example, ’g and ‰· share the *kun* reading *atatakai* but have slightly different connotations when used as free morphemes. On the other hand, à‹ and ’g share the same meanings and *on* reading *dan* as word elements, e.g. as a verb 'to warm', but the free form à‹‚©‚¢ *atatakai* 'warm' is not normally used.

The entry characters and their attributes thus form an inherently complex network of semantic, orthographic and phonologic relations and subrelations often interrelated in highly complex hierarchical structures that do not easily lend themselves to representation by traditional one-to-many and many- to-many relations. Ideally, to express such intricate interrelations in a manner conducive to their effective extraction and analysis approaches the limit of relational databases, and requires a network database design. To do so within the limits of RDB systems requires a thorough analysis aimed to discover the most effective constructs that will, on the one hand, capture and represent the relations between entry characters, compounds, and their respective attributes, and, on the other, allow easy manipulation of the data with a view to efficiently generating a wide range of applications.

In spite of these limitations, we have chosen to adopt dBASE IV, a relational database management system, for a number of reasons, especially its universal availability, ease of manipulating data and developing applications using the Xbase language, and easy portability to other systems. We are also using PERL, a powerful language for text processing and string manipulation.

8. DEVELOPMENT OF DATABASE SYSTEM

The DESK database contains (or will contain) detailed information on every important aspect of Chinese characters as used in CJK languages and the principal Chinese dialects. This includes printed and calligraphic forms, in- depth semantics, phonemics, encoding methods, indexing schemes, synonyms and homophones, character etymology (based on Halpern 1987) and a wealth of other reference data.

The development of software for building the DESK database and the feeding of data to the system is being implemented in six stages.

  1. Developing software for restructuring the old format of NJECD's data to a rationally-structured relational database system on a dBASE platform.
  2. Defining structures and developing software for building a system that is (a) sufficiently flexible to integrate the NJECD database into the broader framework of a comprehensive CJK database system (DESK) and (b) sufficiently open-ended to accommodate large-scale expansion.
  3. Developing software and a menu-driven user interface for querying, searching, sorting, and otherwise manipulating the database system.
  4. Thorough testing, revision, and maintenance of the system.
  5. Building a pilot system for generating data for the New Kanji-English Pocket Dictionary in order to verify that the system is sufficiently robust to cope with dictionary compilation under field conditions.
  6. Feeding large volumes of data to the database from various sources, including NJECD and its German edition, character meanings, compounds and their equivalents, frequency statistics, CJK character readings, character codes, calligraphic styles, etymology, stroke-order diagrams, etc. The system will grow organically through the addition of data from new sources, the compilation of new dictionaries, and the expansion of existing ones.

9. DEVELOPMENT OF KIT APPLICATIONS

The development and compilation of KIT applications and products is being carried out in three stages:

  1. designing the system for each application by (a) performing an in-depth analysis of its special features, such as the range of coverage, ordering scheme, entry layout, appendixes and indexes, and by (b) drawing up software specifications for each application.
  2. building a system for each application by developing application-specific software.
  3. thorough testing, revision, and maintenance of software.

The production of KIT printed products is being carried out in four stages:

  1. adding new data (such as German core meanings)
  2. editing the data generated by each application-specific system, and repeatedly checking the data until it is error-free
  3. developing software to process the data prior to computerized photocomposition
  4. preparing camera-ready mechanicals by DTP and/or computerized photocomposition, to be followed by printing and binding.

NOTE

Lexicography is not yet a recognized discipline in Japan. By building a comprehensive CJK database and using it for compiling numerous lexicographic works, this project will make a significant contribution to the advancement and eventual establishment of lexicography as a branch of learning in Japan, and to the promotion of the study and research of CJK languages.

REFERENCES

APPENDIX A: LIST OF KIT APPLICATIONS

1. GENERAL CHARACTER DICTIONARIES ˆê”ÊŠ¿‰pŽš“T

Below is a list of the principal dictionaries, reference works and learning tools (DISK applications) that could be compiled on the basis of the DESK database. (The asterisk indicates that more detailed information is available for that item.)

  1. * NTC's New Japanese-English Character Dictionary (NTC, 1993)
  2. * New Kanji-English Pocket Dictionary VŠ¿‰p¬Žš“T
  3. *New Kanji-English Learner's Dictionary VŠ¿‰pŠwKŽš“T
  4. * Japanese-English Dictionary of Kanji Synonyms —Þ‹`Š¿Žš˜a‰pŽ«“T
  5. Pocket Kanji Thesaurus —Þ‹`Š¿Žš˜a‰p¬Ž«“T
  6. * Japanese-English Dictionary of Kanji Usage “¯ŒPŽg‚¢•ª‚¯˜a‰pŽ«“T
  7. Japanese-English Kanji Compounds Dictionary ŽÀ—pŠ¿‰pnŒêŽš“TEˆê”Ê•Ò
  8. * New Japanese-German Character Dictionary VŠ¿“ÆŽš“T
  9. New Japanese-Spanish Character Dictionary VŠ¿¼Žš“T
  10. New Japanese-French Character Dictionary VŠ¿•§Žš“T

2. SPECIAL-PURPOSE DICTIONARIES/REFERENCE WORKS “ÁŽêŠ¿ŽšŽš“TEŽQl‘

  1. Introduction to Kanji Š¿Žš“ü–å
  2. *Kanji-English Dictionary for Business and Economics ŽÀ—pŠ¿‰pnŒêŽš“TEŒoÏ•Ò
  3. Kanji-English Dictionary for the Arts and Humanities ŽÀ—pŠ¿‰pnŒêŽš“TE•¶‰»•Ò
  4. Kanji-English Dictionary for Science and Technology ŽÀ—pŠ¿‰pnŒêŽš“TE‰ÈŠw‹Zp•Ò
  5. Introduction to Kanji Compound Formation Š¿ŽšnŒê¬—§‚¿“ü–å
  6. Japanese-English Dictionary of Prefixes and Suffixes Š¿ŽšÚŽ«˜a‰pŽ«“T
  7. Japanese-English Dictionary for Counters and Units ’PˆÊE•”ŽŒ˜a‰pŽ«“T
  8. Kanji Reference Handbook Š¿‰pŽQlî•ñ•Ö——
  9. Japanese-English Dictionary of Character Etymology Š¿‰pŽšŒ¹Žš“T
  10. Introduction to the Radical System Š¿Žš•”Žñ“ü–å
  11. Introduction to Written Japanese “ú–{Œê‘‚«•û“ü–å
  12. *Comparative Study of Sino-Japanese Lexical Items Š¿ŒêŒêœb”äŠrŒ¤‹†

3. ELECTRONIC DICTIONARIES, OTHERS “dŽqŽš“TE‚»‚Ì‘¼

  1. Kanji Learner's Electronic Dictionary “dŽqŠ¿ŽšŠwK‹@
  2. Kanji Learner's Courseware Š¿ŽšŠwKƒR[ƒXƒEƒFƒA
  3. *Kanji Input System Based on System of Kanji Indexing by Patterns ŽšŒ^ŒŸŽš–@‚É‚æ‚銿Žš“ü—Í•ûŽ®
  4. Kanji Games Software Kit Š¿ŽšŠwKƒQ[ƒ€ƒ\ƒtƒg
  5. JIS Kanji Index Based on System of Kanji Indexing by Patterns ŽšŒ^ŒŸŽš–@‚É‚æ‚é‚i‚h‚rŠ¿Žšõˆø
  6. *New Japanese-English Character Dictionary: Electronic Book Edition VŠ¿‰pŽš“T“dŽqƒuƒbƒN”Å
  7. New Japanese-English Character Dictionary: CD-ROM Edition VŠ¿‰pŽš“T‚b‚c|‚q‚n‚l”Å
  8. Kanji Learner's Wall Chart Š¿ŽšŠwK“\Ž†•\
  9. *Kanji Cards Š¿ŽšŠwKƒJ[ƒh
  10. Introduction to Kanji: Video Edition Š¿ŽšŠwKƒrƒfƒI
  11. Train and Subway Kanji Guide “dŽÔE—ñŽÔŠ¿ŽšˆÄ“à
  12. Restaurant Kanji Guide ƒŒƒXƒgƒ‰ƒ“Š¿ŽšˆÄ“à

4. DICTIONARIES AND AIDS FOR JAPANESE USERS “ú–{l‘Ώۂ̎š“TE‹³Þ

  1. Dictionary of Kanji Synonyms —Þ‹`Š¿ŽšŽ«“T
  2. Pocket Kanji Thesaurus —Þ‹`Š¿Žš¬Ž«“T
  3. Dictionary of Kanji Usage “¯ŒPŽg‚¢•ª‚¯Ž«“T
  4. Kanji Learner's Dictionary for Elementary Schoolchildren ¬Šw¶—pŠ¿ŽšŠwKŽš“T
  5. Dictionary of Kanji Compound Formation Š¿ŽšnŒê\¬Ž«“T
  6. Kanji Learner's Courseware Š¿ŽšŠwKƒR[ƒXƒEƒFƒA
  7. Kanji Learner's Dictionary: Electronic Book Edition Š¿ŽšŠwKŽš“T“dŽqƒuƒbƒN”Å
  8. Introduction to Kanji Compound Formation Š¿ŽšnŒê¬—§‚¿“ü–å
  9. Kanji Learner's Graded Wall Chart Šw”N•ÊŠ¿ŽšŠwK“\Ž†•\

APPENDIX B: EDITORIAL COMMITTEE OF KANJI DICTIONARY PUBLISHING SOCIETY

KUSUO HITOMI President of Showa Women's University
Director General and President of KDPS
Chairman of KDPS Editorial Committee

OKI HAYASHI President of the Society for Teaching Japanese as a
Foreign Language
formerly President of the National Language Research
Institute
Consultant to KDPS Editorial Committee

OSAMU MIZUTANI Director General of the National Language Research Institute
Councilor of the Society for Teaching Japanese as a Foreign
Language
Consultant to KDPS Editorial Committee

SHIGEHIKO TOYAMA Professor at the Graduate School of Literature, Showa
Women's University
Member of KDPS Editorial Committee

TAKASHI TAKAMIZAWA Professor/Director of the Course of Japanese Literature,
Showa Women's University
Member of KDPS Editorial Committee

CHIKASADA HARADA Professor of Japanese Literature, Showa Women's University
Member of KDPS Editorial Committee

TOMOKO KANEKO Professor of English and American Literature, Showa
Women's University
Member of KDPS Editorial Committee

KEN LUNDE Project Manager of Japanese Font Production at Adobe
Systems, Inc.
Technical Consultant to KDPS

YOSHIAKI TAKEBE formerly Professor at Waseda University
Member of KDPS Editorial Committee

MASAAKI NOMURA Professor of Japanese at Center for Japanese Language,
Waseda University
Member of KDPS Editorial Committee

ATSUSHI FUKADA Assistant Professor of Applied Linguistics at Center for
Linguistic and Cultural Research, Nagoya University
Member of KDPS Editorial Committee

YOICHIRO YAMAMURA President of Brain Brigade Systems, Ltd.
Production and Marketing Consultant to KDPS

JACK HALPERN Research Fellow at Institute of Modern Culture, Showa
Women's University
Editor in Chief of New Japanese-English Character
Dictionary
Editor in Chief of Kanji Integrated Tools Project

APPENDIX C: OVERVIEW OF PRINCIPAL FEATURES

Listed below are the principal features of DESK-KIT applications and products. The presence or absence of a specific feature depends on the item in question . For more information, see the individual descriptions for each project (available on request), and Features of This Dictionary on page 61 of NJECD).