
| Index to This Document |
|---|
This report provides a brief overview of some linguistic issues related to two text conversion procedures known as transliteration and transcription. A related technology, called transcripting (Chinese-to-Chinese conversion), is described in detail in The Pitfalls and Complexities of Chinese to Chinese Conversion.
The orthographic complexity of some of the major languages using non-Roman scripts, such as Chinese, Japanese, and Arabic, poses formidable challenges to information processing applications. Some factors contributing to this include the large number of characters used in Japanese and Chinese and their complex forms, the lack of vowels in Arabic and other Semitic scripts (known as abjads), and the presence of a large number of orthographic variants. From an information processing point of view there are many complex issues, such as morphological analysis, incompatible character sets, text retrieval, a plethora of input methods, and many others. These are beyond the scope of this brief report, and are mostly covered in a series of papers and articles found at www.cjk.org/cjk/reference.
There is much confusion surrounding the terminology related to the general process of representing the characters of one script in those of another (such as writing Japanese or Arabic in the Roman alphabet), which includes various procedures such as transliteration, transcription, romanization, transscribing, and technography. Sometimes, the term transliteration or transcription is used as a generic term for all these processes, which is quite misleading since it does not distinguish orthographical transliteration, (one-to-one graphemic mapping) from transcription, (essentially one-to-many phonemic mapping).
Note that though the terminology used here is "theoretically correct, " and is used by linguists, esp. grammatologists, these terms are not standardized. In this report, transliteration is used in the strict sense of orthographical transliteration.
The aim of transliteration is to represent the script of a source language by using the letters or symbols of another script, usually in accordance with the orthographical conventions of the target language. Let us take Bin Ladin as an example. In Arabic this is written, from right to left, using the following six letters:
| 6 | 5 | 4 | 3 | 2 | 1 | |
|---|---|---|---|---|---|---|
| ن | د | ا | ل | ن | ب | |
| n | d | ' | l | n | b | |
| nun | dal | alif | lam | nun | ba |
In Arabic, the independent shapes of some letters (graphemes) undergo form transformations (allographs) depending on their position in the word (see the charts at the end of this report). Thus in actual Arabic script Bin Ladin is written:
The essence of transliteration is that each letter (more precisely, each grapheme) is represented by one character or sometimes multiple characters (digraphs or trigraphs). Thus in the table above b corresponds to the letter ba (ب) and n to the letter nun (ن). In good transliteration systems, there is always full one-to-one correspondence to ensure round-trip conversion.
The word بن is actually pronounced bin, but transliteration does not attempt to represent this. It merely maps source script graphemes to target graphemes and is thus graphemic in nature.
Transcription is the representation of the source script of a language in the target script in a manner that reflects the pronunciation of the original, often ignoring graphemic (character-to-character) correspondence. This can be a phonetic transcription, which uses a phonetic alphabet such as IPA to represent the actual speech sounds of the source language (including allophones), or a phonemic transcription, which uses scientific or conventional orthography to represents the phonemes of the source language (ignoring allophones), such as in the romanization of Arabic or Japanese.
Using our example of Bin Ladin, in Arabic this is actually pronounced [bin ladin], and the transcription Bin Ladin reflects this rather accurately, whereas such variants as Bin Laden and Ben Laden do not. As is well known, Arabic is normally written without vowel signs and thus there is no direct way to know the vowels associated with each consonant. With the vowel signs, Bin Ladin is written as follows:
Your browser may not render this properly but you should notice several diacritics above and below some letters which indicate vowels. For example, under the letter dal (د) there is a diagonal line that indicates that the consonant + vowel combination is pronounced [di]. As can be seen, phonemic transcription, which represents the phonemes of the source language, is extremely difficult to achieve in Arabic because the vowel information is missing in normal unvoweled Arabic.
Listed below are some applications of transcription and translilteration technology. There are various other possibilities.
Good transcription/transliteration software should have the following features:
Both transcription and transliteration technologies have useful roles to play, but transcription is often far more difficult to achieve. Though phonemic transcription does not easily lend itself to round-trip conversion, it is quite useful in a variety of applications since the results are written in a human-friendly conventional orthography. But this does not mean that transliteration is not useful. On the contrary, properly transliterated texts are very easy to manipulate and store in any computer application without OS support.
The CJK Dictionary Institute has a developed a transliteration/transcription tool, provisionally called TRANS. This is a generic tool that works on any language pair and could handle complex orthographies like Arabic, Hebrew, Japanese and Chinese. This is a sophisticated tool that has numerous features and options allowing to fine-tune the conversion to specific requirements, and uses script-specific mapping tables and rule tables (some very complex). We use it for converting Arabic, Russian, Simplified <> Traditional Chinese, and other scripts. In principle, TRANS can not only perform round-trip transliteration with 100% accuracy, but can also perform even strict phonemic and phonetic transcription of such complex languages as unvoweled Arabic, Japanese, and Korean.
For the tool to work correctly, precise and complete mapping tables, and in the case of phonemic transcription, complex rules using regular expressions, are required. We have completed some of these tables (such as for Arabic transliteration and Chinese transcription, among others) and others are under development. Our team of linguists have a in-depth knowledge of especially CJK and Semitic languages and are confident in our ability to build mapping tables and rule tables for any language pairs.
See the following links for more information:
A chart showing various transcription and transliteration systems for Arabic can be viewed at Transcriptions_a.doc. This is a MS Word document with various special fonts that are difficult to display in html and even in Word. If you cannot display properly it in Word please contact the author.