Introduction to i18n -------------------- Tomohiro KUBOTA 2 March 2008 ------------------------------------------------------------------------------- Abstract -------- This document describes basic concepts for i18n (internationalization), how to write an internationalized software, and how to modify and internationalize a software. Handling of characters is discussed in detail. There are a few case-studies in which the author internationalized softwares such as TWM. Copyright Notice ---------------- Copyright (C) 1999-2001 Tomohiro KUBOTA. Chapters and sections whose original author is not KUBOTA are copyright by their authors. Their names are written at the top of the chapter or the section. This manual is free software; you may redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. This is distributed in the hope that it will be useful, but _without any warranty_; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details. A copy of the GNU General Public License is available as `/usr/share/common-licenses/GPL' in the Debian GNU/Linux distribution or on the World Wide Web at http://www.gnu.org/copyleft/gpl.html. You can also obtain it by writing to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. ------------------------------------------------------------------------------- Contents -------- 1. About This Document 1.1. Scope 1.2. New Versions of This Document 1.3. Feedback and Contributions 2. Introduction 2.1. General Concepts 2.2. Organization 3. Important Concepts for Character Coding Systems 3.1. Basic Terminology 3.2. Stateless and Stateful 3.3. Multibyte encodings 3.4. Number of Bytes, Number of Characters, and Number of Columns 4. Coded Character Sets And Encodings in the World 4.1. ASCII and ISO 646 4.2. ISO 8859 4.3. ISO 2022 4.3.1. EUC (Extended Unix Code) 4.3.2. ISO 2022-compliant Character Sets 4.3.3. ISO 2022-compliant Encodings 4.4. ISO 10646 and Unicode 4.4.1. UCS as a Coded Character Set 4.4.2. UTF as Character Encoding Schemes 4.4.3. Problems on Unicode 4.5. Other Character Sets and Encodings 4.5.1. Big5 4.5.2. UHC 4.5.3. Johab 4.5.4. HZ, aka HZ-GB-2312 4.5.5. GBK 4.5.6. GB18030 4.5.7. GCCS 4.5.8. HKSCS 4.5.9. Shift-JIS 4.5.10. VISCII 4.5.11. TRON 4.5.12. Mojikyo 5. Characters in Each Country 5.1. Japanese language / used in Japan 5.1.1. Characters used in Japanese 5.1.2. Character Sets 5.1.3. Encodings 5.1.4. How These Encodings Are Used --- Information for Programmers 5.1.5. Columns 5.1.6. Writing Direction and Combined Characters 5.1.7. Layout of Characters 5.1.8. LANG variable 5.1.9. Input from Keyboard 5.1.10. More Detailed Discussions 5.2. Spanish language / used in Spain, most of America and Equatorial Guinea 5.2.1. Characters used in Spanish 5.2.2. Character Sets 5.2.3. Codesets 5.2.4. How These Codesets Are Used --- Information for Programmers 5.2.5. Columns 5.2.6. Writing Direction 5.2.7. Layout of Characters 5.2.8. LANG variable 5.2.9. Input from Keyboard 5.2.10. More Detailed Discussions 5.3. Languages with Cyrillic script 6. LOCALE technology 6.1. Locale Categories and `setlocale()' 6.2. Locale Names 6.3. Multibyte Characters and Wide Characters 6.4. Unicode and LOCALE technology 6.5. `nl_langinfo()' and `iconv()' 6.6. Limit of Locale technology 7. Output to Display 7.1. Console Softwares 7.1.1. Encoding 7.1.2. Number of Columns 7.2. X Clients 7.2.1. Xlib programming 7.2.2. Athena widgets 7.2.3. Gtk and Gnome 7.2.4. Qt and KDE 8. Input from Keyboard 8.1. Non-X Softwares 8.2. X Softwares 8.2.1. Developing XIM clients 8.2.2. Examples of XIM softwares 8.2.3. Using XIM softwares 8.3. Emacsen 9. Internal Processing and File I/O 9.1. Stream I/O of Characters 9.2. Character Classification 9.3. Length of String 9.4. Extraction of Characters 10. the Internet 10.1. Mail/News 10.2. WWW 11. Libraries and Components 11.1. Gettext and Translation 11.1.1. Gettext-ization of A Software 11.1.2. Translation 11.2. Readline Library 11.3. Ncurses Library 12. Softwares Written in Other than C/C++ 12.1. Fortran 12.2. Pascal 12.3. Perl 12.4. Python 12.5. Ruby 12.6. Tcl/Tk 12.7. Java 12.8. Shell Script 12.9. Lisp 13. Examples of I18N 13.1. TWM -- usage of XFontSet instead of XFontStruct 13.1.1. Introduction 13.1.2. Locale Setting - A Routine Work 13.1.3. Font Preparation 13.1.4. Automatic Font Guessing 13.1.5. Font Preparation (continued) 13.1.6. Drawing Text using `MyFont' 13.1.7. Geting Size of Texts 13.1.8. Getting Window Titles 13.1.9. Getting Icon Names 13.1.10. Configuration File Parser 13.2. 8bit-clean-ize of Minicom 13.2.1. 8bit-clean-ize 13.2.2. Not to break continuity of multibyte characters 13.2.3. Catalog in EUC-JP and SHIFT-JIS 13.3. user-ja -- two sets of messages in ASCII and native codeset in the same language 13.3.1. Introduction 13.3.2. Strategy 13.3.3. Implementation 13.4. A Quasi-Wrapper to Internationalize Text Output of X Clients 13.4.1. Introduction 13.4.2. Strategy 13.4.3. Usage of the wrapper 13.4.4. The Header File of the Wrapper 13.4.5. The Source File of the Wrapper 14. References ------------------------------------------------------------------------------- 1. About This Document ---------------------- 1.1. Scope ---------- This document describes the basic ideas of I18N; it's written for programmers and package maintainers of Debian GNU/Linux and other UNIX-like platforms. The aim of this document is to offer an introduction to the basic concepts, character codes, and points where care should be taken when one writes an I18N-ed software or an I18N patch for an existing software. There are many know-hows and case-studies on internationalization of softwares. This document also tries to introduce the current state and existing problems for each language and country. Minimum requirements - for example, that characters should be displayed with fonts of the proper charset (users of the software must be able to at least guess what is written), that characters must be inputed from keyboard, and that softwares must not destroy characters - are stressed in the document. I am trying to describe a HOWTO to satisfy these requirements. This document is strongly related to programming languages such as C and standardized I18N methods such as using locales and `gettext'. 1.2. New Versions of This Document ---------------------------------- The current version of this document is available at DDP (Debian Documentation Project) (http://www.debian.org/doc/ddp) page. Note that the author rewrote this document in November 2000. 1.3. Feedback and Contributions ------------------------------- This document needs contributions, especially for a chapter on each languages (Chapter 5, `Characters in Each Country') and a chapter on instances of I18N (Chapter 13, `Examples of I18N'). These chapters consist of contributions. Otherwise, this will be a document only on Japanization, because the original author Tomohiro KUBOTA () speaks Japanese and live in Japan. Section 5.2, `Spanish language / used in Spain, most of America and Equatorial Guinea' is written by Eusebio C Rufian-Zilbermann . Discussions are held at `debian-devel@lists.debian.org' mailing list. (Maybe `debian-doc' or `debian-i18n' would be more suitable?) ------------------------------------------------------------------------------- 2. Introduction --------------- 2.1. General Concepts --------------------- Debian includes many pieces of software. Though many of them have the ability to process, input, and output text data, some of these programs assume text is written in English (ASCII). For people who use non-English languages, these programs are barely usable. And more, though many softwares can handle not only ASCII but also ISO-8859-1, some of them cannot handle multibyte characters for CJK (Chinese, Japanese, and Korean) languages, nor combined characters for Thai. So far, people who use non-English languages have given up using their native languages and have accepted computers as they were. However, we should now forget such a wrong idea. It is absurd that a person who wants to use a computer has to learn English in advance. I18N is needed in the following places. * Displaying characters for the users' native languages. * Inputing characters for the users' native languages. * Handling files written in popular encodings [1] that are used for the users' native languages. * Using characters from the users' native languages for file names and other items. * Printing out characters from the users' native languages. * Displaying messages by the program in the users' native languages. * Formatting input and output of numbers, dates, money, etc., in a way that obeys customs of the users' native cultures. * Classifying and sorting characters, in a way that obey customs of the users' native cultures. * Using typesetting and hyphenation rules appropriate for the users' native languages. This document puts emphasis on the first three items. This is because these three items are the basis for the other items. An another reason is that you cannot use softwares lacking the first three items at all, while you can use softwares lacking the other items, albeit inconveniently. This document will also mention translation of messages (item 6) which is often called as 'I18N'. Note that the author regards the terminology of 'I18N' for calling translation and `gettext'ization as completely wrong. The reason may be well explained by the fact that the author did not include translation and `gettext'ization in the important first three items. Imagine a word processor which can display error and help messages in your native language while cannot process your native language. You will easily understand that the word processor is not usable. On the other hand, a word processor which can process your native language, but only displays error and help messages in English, is usable, though it is not convenient. Before we think of developing convenient softwares, we have to think of developing usable softwares. The following terminology is widely used. * I18N (internationalization) means modification of a software or related technologies so that a software can potentially handle multiple languages, customs, and so on in the world. * L10N (localization) means implementation of a specific language for an already internationalized software. However, this terminology is valid only for one specific model out of a few models which we should consider for I18N. Now I will introduce a few models other than this I18N-L10N model. a. _L10N_ (localization) model This model is to support two languages or character codes, English (ASCII) and another specific one. Examples of softwares which is developed using this model are: Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual Emacs) text editor which can input and output Japanese text files, and Hanterm X terminal emulator which can display and input Korean characters via a few Korean encodings. Since each programmer has his or her own mother tongue, there are numerous L10N patches and L10N programs written to satisfy his or her own need. b. _I18N_ (internationalization) model This model is to support many languages but only two of them, English (ASCII) and another one, at the same time. One have to specify the 'another' language, usually by `LANG' environmental variable. The above I18N-L10N model can be regarded as a part of this I18N model. `gettext'ization is categorized into I18N model. c. _M17N_ (multilingualization) model This model is to support many languages at the same time. For example, Mule (MULtilingual Enhancement to GNU Emacs) can handle a text file which contains multiple languages - for example, a paper on differences between Korean and Chinese whose main text is written in Finnish. GNU Emacs 20 and XEmacs now include Mule. Note that the M17N model can only be applied in character-related instances. For example, it is nonsense to display a message like 'file not found' in many languages at the same time. Unicode and UTF-8 are technologies which can be used for this model. [2] Generally speaking, the M17N model is the best and the second-best is the I18N model. The L10N model is the worst and you should not use it except for a few fields where the I18N and M17N models are very difficult, like DTP and X terminal emulator. In other words, it is better for text-processing softwares to handle many languages at the same time, than handle two (English and another language). Now let me classify approaches for support of non-English languages from another viewpoint. A. Implementation _without_ knowledge of each language This approach is done by utilizing standardized methods supplied by the kernel or libraries. The most important one is _locale_ technology which includes _locale category_, conversion between _multibyte_ and _wide characters_ (`wchar_t'), and so on. Another important technology is `gettext'. The advantages of this approach are (1) that when the kernel or libraries are upgraded, the software will automatically support new additional languages, (2) that programmers need not know each language, and (3) that a user can switch the behavior of softwares with common method, like LANG variable. The disadvantage is that there are categories or fields where a standardized method is not available. For example, there are no standardized methods for text typesetting rules such as line-breaking and hyphenation. B. Implementation using knowledge of each language This approach is to directly implement information about each language based on the knowledge of programmers and contributors. L10N almost always uses this approach. The advantage of this approach is that a detailed and strict implementation is possible beyond the field where standardized methods are available, such as auto-detection of encodings of text files to be read. Language-specific problems can be perfectly solved; of course, it depends on the skill of the programmer). The disadvantages are (1) that the number of supported languages is restricted by the skill or the interest of the programmers or the contributors, (2) that labor which should be united and concentrated to upgrade the kernel or libraries is dispersed into many softwares, that is, re-inventing of the wheel, and (3) a user has to learn how to configure each software, such as `LESSCHARSET' variable, `.emacs' file, and other methods. This approach can cause problems: for example, GNU roff (before version 1.16) assumes `0xad' as a hyphen character, which is valid only for ISO-8859-1. However, a majestic M17N software such as Mule can be built using this approach. Using this classification, let me consider the L10N, I18N, and M17N models from the programmer's point of view. The L10N model can be realized only using his or her own knowledge on his or her language (i.e. approach B). Since the motivation of L10N is usually to satisfy the programmer's own need, extendability for the third languages is often ignored. Though L10N-ed softwares are primarily useful for people who speaks the same language to the programmer, it is sometimes useful for other people whose coding system is similar to the programmer's. For example, a software which doesn't recognize EUC-JP but doesn't break EUC-JP, will not break EUC-KR also. The main part of the I18N model is, in the case of a C program, achieved using standardized locale technology and `gettext'. An locale approach is classified into I18N because functions related to locale change their behavior by the current locales for six categories which are set by `setlocale()'. Namely, approach A is emphasized for I18N. For field where standardized methods are not available, however, approach B cannot be avoided. Even in such a case, the developers should be careful so that a support for new languages can be easily added later even by other developers. The M17N model can be achieved using international encodings such as ISO 2022 and Unicode. Though you can hard-code these encodings for your software (i.e. approach B), I recommend to use standardized locale technology. However, using international encodings is not sufficient to achieve the M17N model. You will have to prepare a mechanism to switch _input methods_. You will also want to prepare an encoding-guessing mechanism for input files, such as `jless' and `emacs' have. Mule is the best software which achieved M17N (though it does not use locale technology). [1] There are a few terms related to character code, such as character set, character code, charset, encoding, codeset, and so on. These words are explained later. [2] I recommend not to implement Unicode and UTF-8 directly. Instead, use locale technology and your software will support not only UTF-8 but also many encodings in the world. If you implement UTF-8 directly, your software can handle UTF-8 only. Such a software is not convenient. 2.2. Organization ----------------- Let's preview the contents of each chapter in this document. As I wrote, this document will put stress on correct handling of characters and character codes for users' native languages. To achieve this purpose, I will start the real contents of this document by discussing basic important concepts on characters in Chapter 3, `Important Concepts for Character Coding Systems'. Since this chapter includes many terminologies, all of you will need to this chapter. The next chapter, Chapter 4, `Coded Character Sets And Encodings in the World', introduces many national and international standards of _coded character sets_ and _encodings_. I think almost of you can do without reading this chapter, since _LOCALE_ technology will enable us to develop international softwares without knowledges on these character sets and encodings. However, knowing about these standards will help you to understand the merit and necessity of LOCALE technology. The following chapter of Chapter 5, `Characters in Each Country' describes the detailed informations for each language. These informations will help people who develop high-quality text processing softwares such as DTP and Web Browsers. Chapter of Chapter 6, `LOCALE technology' describes the most important concept for I18N. Not only concepts but also many important C functions are introduced in this chapter. A few following chapters of Chapter 7, `Output to Display', Chapter 8, `Input from Keyboard', Chapter 9, `Internal Processing and File I/O', and Chapter 10, `the Internet' are important and frequent applications of LOCALE technology. You can get solutions for typical problems on I18N in these chapters. You may need to develop software using some special libraries or other languages than C/C++. Chapters of Chapter 11, `Libraries and Components' and Chapter 12, `Softwares Written in Other than C/C++' are written for such purposes. Next chapter of Chapter 13, `Examples of I18N' is a collection of case studies. Both of generic and special technologies will be discussed. You can also contribute writing a section for this chapter. You may want to study more; The last chapter of Chapter 14, `References' is supplied for this purpose. Some of references listed in the chapter are very important. ------------------------------------------------------------------------------- 3. Important Concepts for Character Coding Systems -------------------------------------------------- Character coding system is one of the fundamental elements of the software and information processing. Without proper handling of character codes, your software is far from realization of internationalization. Thus the author begins this document with the story on character codes. In this chapter, basic concepts such as _coded character set_ and _encoding_ are introduced. These terms will be needed to read this document and other documents on internationalization and character codes including Unicode. 3.1. Basic Terminology ---------------------- At first I begin this chapter by defining a few very important word. As many people point out, there is a confusion on terminology, since words are used in various different ways. The author does not want to add a new terminology to a confusing ocean of various terminologies. Otherwise, terminology of RFC 2130 (http://www.faqs.org/rfcs/rfc2130.html) will be adopted in this document, besides one exception of a word 'character set'. _Character_ Character is an individual unit of which sentence and text consist. Character is an abstract notion. _Glyph_ Glyph is a specific instance of character. _Character_ and _glyph_ is a pair of words. Sometimes a character has multiple glyphs (for example, '$' may have one or two vertical bar. Arabic characters have four glyphs for each character. Some of CJK ideograms have many glyphs). Sometimes two or more characters construct one glyph (for example, ligature of 'fi'). For almost cases, text data, which intend to contain not visual information but abstract idea, don't have to have information on glyphs, since difference between glyphs does not affect the meaning of the text. However, distinction between different glyphs for a single CJK ideogram may be sometimes important for proper noun such as names of persons and places. However, there are no standardized method for plain text to have informations on glyphs so far. This makes plain texts cannot be used for some special fields such as citizen registration system, serious DTP such as newspaper system, and so on. _Encoding_ Encoding is a rule where characters and texts are expressed in combinations of bits or bytes in order to treat characters in computers. Words of _character coding system_, _character code_, _charset_, and so on are used to express the same meaning. Basically, _encoding_ takes care of _characters_, not _glyphs_. There are many official and de-facto standards of encodings such as ASCII, ISO 8859-{1,2,...,15}, ISO 2022-{JP, JP-1, JP-2, KR, CN, CN-EXT, INT-1, INT-2}, EUC-{JP, KR, CN, TW}, Johab, UHC, Shift-JIS, Big5, TIS 620, VISCII, VSCII, so-called 'CodePages', UTF-7, UTF-8, UTF-16LE, UTF-16BE, KOI8-R, and so on so on. To construct an encoding, we have to consider the following concepts. (Encoding = one or more CCS + one CES). _Character Set_ Character set is a set of characters. This determines a range of characters where the encoding can handle. In contrast to _coded character set_, this is often called as _non-coded character set_. _Coded Character Set (CCS)_ Coded character set (CCS) is a word defined in RFC 2050 (http://www.faqs.org/rfcs/rfc2050.html) and means a character set where all characters have unique numbers by some method. There are many national and international standards for CCS. Many national standards for CCS adopt the way of coding so that they obey some of international standards such as ISO 646 or ISO 2022. ASCII, BS 4730, JISX 0201 Roman, and so on are examples of ISO-646 variants. All ISO-646 variants, ISO 8859-*, JISX 0208, JISX 0212, KSX 1001, GB 2312, CNS 11643, CCCII, TIS 620, TCVN 5712, and so on are examples of ISO 2022-compliant CCS. VISCII and Big5 are examples of non-ISO 2022-compliant CCS. UCS-2 and UCS-4 (ISO 10646) are also examples of CCS. _Character Encoding Scheme (CES)_ Character Encoding Scheme is also a word defined in RFC 2050 (http://www.faqs.org/rfcs/rfc2050.html) to call methods to construct an encoding using one or more CCS. This is important when two or more CCS are used to construct an encoding. ISO 2022 is a method to construct an encoding from one or more ISO 2022-compliant CCS. ISO 2022 is very complex system and subsets of ISO 2022 are usually used such as EUC-JP (ASCII and JISX 0208), ISO-2022-KR (ASCII and KSX 1001), and so on. CES is not important for encodings with only one 8bit CCS. UTF series (UTF-8, UTF-16LE, UTF-16BE, and so on) can be regarded as CES whose CCS is Unicode or ISO 10646. Some other words are usually used related to character codes. _Character code_ is a widely-used word to mean _encoding_. This is an primitive and crude word to call the way a computer handles characters with assigning numbers. For example, _character code_ can call _encoding_ and can call _coded character set_. Thus this word can be used only in the case when both of them can be regard in the same category. This word should be avoided in serious discussions. This document will not use this word hereafter. _Codeset_ is a word to call _encoding_ or _character encoding scheme_. [1] _charset_ is also a well-used word. This word is used very widely, for example, in MIME (like `Content-Type: text/plain, charset=iso8859-1'), in XLFD (X Logical Font Description) font name (CharSetResigtry and CharSetEncoding fields), and so on. Note that _charset_ in MIME is _encoding_, while _charset_ in XLFD font name is _coded character set_. This is very confusing. In this document, _charset_ and _character set_ are used in XLFD meaning, since I think _character set_ should mean a set of characters, not encoding. Ken Lunde's "CJKV Information Processing" uses a word _encoding method_. He says that ISO-2022, EUC, Big5, and Shift-JIS are examples of _encoding methods_. It seems that his _encoding method_ is _CES_ in this document. However, we should notice that Big5 and Shift-JIS are encodings while ISO-2022 and EUC are not. [2] Character Encoding Model, Unicode Technical Report #17 (http://www.unicode.org/unicode/reports/tr17/) (hereafter, _"the Report"_) suggests five-level model. * ACR: abstract character repertoire * CCS: Coded Character Set * CEF: Character Encoding Form * CES: Character Encoding Scheme * TES: Transfer Encoding Syntax _TES_ is also suggested in RFC 2130 (http://www.faqs.org/rfcs/rfc2130.html). Some examples of TES are: _base64_, _uuencode_, _BinHex_, _quoted-printable_, _gzip_, and so on. TES means a transform of encoded data which may (or may not) include textual data. Thus, TES is not a part of character encoding. However, TES is important in the Internet data exchange. When using a computer, we rarely have a chance to face with _ACR_. Though it is true that CJK people have their national standard of ACR (for example, standard for ideograms which can be used for personal names) and some of us may need to handle these ACR with computers (for example, citizen registration system), this is too heavy theme for this document. This is because there are no standardized or encouraged methods to handle these ACR. You may have to build the whole system for such purposes. Good luck! _CCS_ in _"the Report"_ is same as what I wrote in this document. It has concrete examples: ASCII, ISO 8859-{1,2,...,15}, JISX 0201, JISX 0208, JISX 0212, KSX 1001, KSX 1002, GB 2312, Big5, CNS 11643, TIS 620, VISCII, TCVN 5712, UCS2, UCS4, and so on. Some of them are national standards, some are international standards, and others are de-facto standards. _CEF_ and _CES_ in _"the Report"_ correspond to _CES_ in this document. This document will not distinguish these two, since I think there are no inconvenience. An encoding with a significant CEF doesn't have a significant CES (in _"the Report"_ meaning), and vice versa. Then why should we have to distinguish these two? The only exception is UTF-16 series. In UTF-16 series, UTF-16 is a CEF and UTF-16BE is a CES. This is the only case where we need distinction between CEF and CES. Now, _CES_ is a concrete concept with concrete examples: ASCII, ISO 8859-{1,2,...,15}, EUC-JP, EUC-KR, ISO 2022-JP, ISO 2022-JP-1, ISO 2022-JP-2, ISO 2022-CN, ISO 2022-CN-EXT, ISO 2022-KR, ISO 2022, VISCII, UTF-7, UTF-8, UTF-16LE, UTF-16BE, and so on. Now they are encodings themselves. The most important concept in this section is distinction between _coded character set_ and _encoding_. _Coded character set_ is a component of _encoding_. Text data are described in _encoding_, not _coded character set_. [1] This document used a word _codeset_ before Novermber 2000 to call _encoding_. I changed terminology since I could not find a word _codeset_ in documents written in English (I adopted this word from a book in Japanese). _encoding_ seems more popular. [2] During I18N programming, we will frequently meet with EUC-JP or EUC-KR, while we well rarely meet with EUC. I think it is not appropriate to stress EUC, a class of encodings, over EUC-JP, EUC-KR, and so on, concrete encodings. It is just like regarding ISO 8859 as a concrete encoding, though ISO 8859 is a class of encodings of ISO 8859-{1,2,...,15}. 3.2. Stateless and Stateful --------------------------- To construct an encoding with two or more CCS, CES has to supply a method to avoid collision between these CCS. There are two ways to do that. One is to make all characters in the all CCS have unique code points. The other is to allow characters from different CCS to have the same code point and to have a code such as escape sequence to switch _SHIFT STATE_, that is, to select one character set. An encoding with shift states is called _STATEFUL_ and one without shift states is called _STATELESS_. Examples of stateful encodings are: ISO 2022-JP, ISO 2022-KR, ISO 2022-INT-1, ISO 2022-INT-2, and so on. For example, in ISO 2022-JP, two bytes of `0x24 0x2c' may mean a Japanese Hiragana character 'GA' or two ASCII character of '$' and ',' according to the shift state. 3.3. Multibyte encodings ------------------------ Encodings are classified into multibyte ones and the others, according to the relationship between number of characters and number of bytes in the encoding. In non-multibyte encoding, one character is always expressed by one byte. On the other hand, one character may expressed in one or more bytes in multibyte encoding. Note that the number is not fixed even in a single encoding. Examples of multibyte encodings are: EUC-JP, EUC-KR, ISO 2022-JP, Shift-JIS, Big5, UHC, UTF-8, and so on. Note that all of UTF-* are multibyte. Examples of non-multibyte encodings are: ISO 8859-1, ISO 8859-2, TIS 620, VISCII, and so on. Note that even in non-multibyte encoding, number of characters and number of bytes may differ if the encoding is stateful. Ken Lunde's "CJKV Information Processing" [1] classifies encoding methods into the following three categories: * modal * non-modal * fixed-length _Modal_ corresponds to _stateful_ in this document. Other two are _stateless_, where _non-modal_ is _multibyte_ and _fixed-length_ is _non-multibyte_. However, I think _stateful_ - _stateless_ and _multibyte_ - _non-multibyte_ are independent concept. [2] [1] ISBN 1-56592-224-7, O'Reilly, 1999 [2] though there are no existing encodings which is stateful and non-multibyte. 3.4. Number of Bytes, Number of Characters, and Number of Columns ----------------------------------------------------------------- One ASCII character is always expressed by one byte and occupies one column on console or X terminal emulators (fixed font for X). One must not make such an assumption for I18N programming and have to clearly distinguish number of bytes, characters, and columns. Speaking of relationship between characters and bytes, in multibyte encodings, two or more bytes may be needed to express one character. In stateful encodings, escape sequences are not related to any characters. Number of columns is not defined in any standards. However, it is usual that CJK ideograms, Japanese Hiragana and Katakana, and Korean Hangul occupy two columns in console or X terminal emulators. Note that 'Full-width forms' in UCS-2 and UCS-4 coded character set will occupy two columns and 'Half-width forms' will occupy one column. Combining characters used for Thai and so on can be regarded as zero-column characters. Though there are no standards, you can use `wcwidth()' and `wcswidth()' for this purpose. See Section 7.1.2, `Number of Columns' for detail. ------------------------------------------------------------------------------- 4. Coded Character Sets And Encodings in the World -------------------------------------------------- Here major coded character sets and encodings are introduced. Note that you don't have to know the detail of these character codes if you use LOCALE and `wchar_t' technology. However, these knowledge will help you to understand why number of bytes, characters, and columns should be counted separately, why `strchr()' and so on should not be used, why you should use LOCALE and `wchar_t' technology instead of hard-code processing of existing character codes, and so on so on. These varieties of character sets and encodings will tell you about struggles of people in the world to handle their own languages by computers. Especially, CJK people could not help working out various technologies to use plenty of characters within ASCII-based computer systems. If you are planning to develop a text-processing software beyond the fields which the LOCALE technology covers, you will have to understand the following descriptions very well. These fields include automatic detection of encodings used for the input file (Most of Japanese-capable text viewers such as `jless' and `lv' have this mechanism) and so on. 4.1. ASCII and ISO 646 ---------------------- _ASCII_ is a CCS and also an encoding at the same time. ASCII is 7bit and contains 94 printable characters which are encoded in the region of `0x21'-`0x7e'. _ISO 646_ is the international standard of ASCII. Following 12 characters of * 0x23 (number), * 0x24 (dollar), * 0x40 (at), * 0x5b (left square bracket), * 0x5c (backslash), * 0x5d (right square bracket), * 0x5e (caret), * 0x60 (backquote), * 0x7b (left curly brace), * 0x7c (vertical line), * 0x7d (right curly brace), and * 0x7e (tilde) are called _IRV_ (International Reference Version) and other 82 (94 - 12 = 82) characters are called _BCT_ (Basic Code Table). Characters at IRV can be different between countries. Here is a few examples of versions of ISO 646. * UK version (BS 4730) * US version (ASCII): 0x23 is pound currency mark, and so on. * Japanese version (JISX 0201 Roman): 0x5c is yen currency mark, and so on. * Italian version (UNI 0204-70): 0x7b is 'a' with grave accent, and so on. * French version (NF Z 62-010): 0x7b is 'e' with acute accent, and so on. As far as I know, all encodings (besides EBCDIC) in the world are compatible with ISO 646. Characters in 0x00 - 0x1f, 0x20, and 0x7f are control characters. Nowadays usage of encodings incompatible with ASCII is not encouraged and thus ISO 646-* (other than US version) should not be used. One of the reason is that when a string is converted into Unicode, the converter doesn't know whether IRVs are converted into characters with same shapes or characters with same codes. Another reason is that source codes are written in ASCII. Source code must be readable anywhere. 4.2. ISO 8859 ------------- _ISO 8859_ is both a series of CCS and a series of encodings. It is an expansion of ASCII using all 8 bits. Additional 96 printable characters encoded in 0xa0 - 0xff are available besides 94 ASCII printable characters. There are 10 variants of ISO 8859 (in 1997). ISO-8859-1 Latin alphabet No.1 (1987) characters for western European languages ISO-8859-2 Latin alphabet No.2 (1987) characters for central European languages ISO-8859-3 Latin alphabet No.3 (1988) ISO-8859-4 Latin alphabet No.4 (1988) characters for northern European languages ISO-8859-5 Latin/Cyrillic alphabet (1988) ISO-8859-6 Latin/Arabic alphabet (1987) ISO-8859-7 Latin/Greek alphabet (1987) ISO-8859-8 Latin/Hebrew alphabet (1988) ISO-8859-9 Latin alphabet No.5 (1989) same as ISO-8859-1 except for Turkish instead of Icelandic ISO-8859-10 Latin alphabet No.6 (1993) Adds Inuit (Greenlandic) and Sami (Lappish) letters to ISO-8859-4 ISO-8859-11 Latin/Thai alphabet (2001) same as TIS-620 Thai national standard A detailed explanation is found at http://park.kiev.ua/mutliling/ml-docs/iso-8859.html. 4.3. ISO 2022 ------------- Using ASCII and ISO 646, we can use 94 characters at most. Using ISO 8859, the number includes to 190 (= 94 + 96). However, we may want to use much more characters. Or, we may want to use some, not one, of these character sets. One of the answer is ISO 2022. _ISO 2022_ is an international standard of CES. ISO 2022 determines a few requirement for CCS to be a member of ISO 2022-based encodings. It also defines a very extensive (and complex) rules to combine these CCS into one encoding. Many encodings such as EUC-*, ISO 2022-*, compound text, [1] and so on can be regarded as subsets of ISO 2022. ISO 2022 is so complex that you may be not able to understand this. It is OK; What is important here is the concept of ISO 2022 of building an encoding by switching various (ISO 2022-compliant) coded character sets. The sixth edition of ECMA-35 is fully identical with ISO 2022:1994 and you can find the official document at http://www.ecma.ch/ecma1/stand/ECMA-035.HTM. ISO 2022 has two versions of 7bit and 8bit. At first 8bit version is explained. 7bit version is a subset of 8bit version. The 8bit code space is divided into four regions, * 0x00 - 0x1f: C0 (Control Characters 0), * 0x20 - 0x7f: GL (Graphic Characters Left), * 0x80 - 0x9f: C1 (Control Characters 1), and * 0xa0 - 0xff: GR (Graphic Characters Right). GL and GR is the spaces where (printable) character sets are mapped. Next, all character sets, for example, ASCII, ISO 646-UK, and JIS X 0208, are classified into following four categories, * (1) character set with 1-byte 94-character, * (2) character set with 1-byte 96-character, * (3) character set with multibyte 94-character, and * (4) character set with multibyte 96-character. Characters in character sets with 94-character are mapped into 0x21 - 0x7e. Characters in 96-character set are mapped into 0x20 - 0x7f. For example, ASCII, ISO 646-UK, and JISX 0201 Katakana are classified into (1), JISX 0208 Japanese Kanji, KSX 1001 Korean, GB 2312-80 Chinese are classified into (3), and ISO 8859-* are classified to (2). The mechanism to map these character sets into GL and GR is a bit complex. There are four buffers, G0, G1, G2, and G3. A character set is _designated_ into one of these buffers and then a buffer is _invoked_ into GL or GR. Control sequences to 'designate' a character set into a buffer are determined as below. * A sequence to designate a character set with 1-byte 94-character * into G0 set is: ESC 0x28 F, * into G1 set is: ESC 0x29 F, * into G2 set is: ESC 0x2a F, and * into G3 set is: ESC 0x2b F. * A sequence to designate a character set with 1-byte 96-character * into G1 set is: ESC 0x2d F, * into G2 set is: ESC 0x2e F, and * into G3 set is: ESC 0x2f F. * A sequence to designate a character set with multibyte 94-character * into G0 set is: ESC 0x24 0x28 F (exception: 'ESC 0x24 F' for F = 0x40, 0x41, 0x42.), * into G1 set is: ESC 0x24 0x29 F, * into G2 set is: ESC 0x24 0x2a F, and * into G3 set is: ESC 0x24 0x2b F. * A sequence to designate a character set with multibyte 96-character * into G1 set is: ESC 0x24 0x2d F, * into G2 set is: ESC 0x24 0x2e F, and * into G3 set is: ESC 0x24 0x2f F. where 'F' is determined for each character set: * character set with 1-byte 94-character * F=0x40 for ISO 646 IRV: 1983 * F=0x41 for BS 4730 (UK) * F=0x42 for ANSI X3.4-1968 (ASCII) * F=0x43 for NATS Primary Set for Finland and Sweden * F=0x49 for JIS X 0201 Katakana * F=0x4a for JIS X 0201 Roman (Latin) * and more * character set with 1-byte 96-character * F=0x41 for ISO 8859-1 Latin-1 * F=0x42 for ISO 8859-2 Latin-2 * F=0x43 for ISO 8859-3 Latin-3 * F=0x44 for ISO 8859-4 Latin-4 * F=0x46 for ISO 8859-7 Latin/Greek * F=0x47 for ISO 8859-6 Latin/Arabic * F=0x48 for ISO 8859-8 Latin/Hebrew * F=0x4c for ISO 8859-5 Latin/Cyrillic * and more * character set with multibyte 94-character * F=0x40 for JISX 0208-1978 Japanese * F=0x41 for GB 2312-80 Chinese * F=0x42 for JISX 0208-1983 Japanese * F=0x43 for KSC 5601 Korean * F=0x44 for JISX 0212-1990 Japanese * F=0x45 for CCITT Extended GB (ISO-IR-165) * F=0x46 for CNS 11643-1992 Set 1 (Taiwan) * F=0x48 for CNS 11643-1992 Set 2 (Taiwan) * F=0x49 for CNS 11643-1992 Set 3 (Taiwan) * F=0x4a for CNS 11643-1992 Set 4 (Taiwan) * F=0x4b for CNS 11643-1992 Set 5 (Taiwan) * F=0x4c for CNS 11643-1992 Set 6 (Taiwan) * F=0x4d for CNS 11643-1992 Set 7 (Taiwan) * and more The complete list of these coded character set is found at International Register of Coded Character Sets (http://www.itscj.ipsj.or.jp/ISO-IR/). Control codes to 'invoke' one of G{0123} into GL or GR is determined as below. * A control code to invoke G0 into GL is: (L)SO ((Locking) Shift Out) * A control code to invoke G1 into GL is: (L)SO ((Locking) Shift In) * A control code to invoke G2 into GL is: LS2 (Locking Shift 2) * A control code to invoke G3 into GL is: LS3 (Locking Shift 3) * A control code to invoke one character in G2 into GL is: SS2 (Single Shift 2) * A control code to invoke one character in G3 into GL is: SS3 (Single Shift 3) * A control code to invoke G1 into GR is: LS1R (Locking Shift 1 Right) * A control code to invoke G2 into GR is: LS2R (Locking Shift 2 Right) * A control code to invoke G3 into GR is: LS3R (Locking Shift 3 Right) [2] Note that a code in a character set invoked into GR is or-ed with 0x80. ISO 2022 also determines _announcer_ code. For example, 'ESC 0x20 0x41' means 'Only G0 buffer is used. G0 is already invoked into GL'. This simplify the coding system. Even this announcer can be omitted if people who exchange data agree. 7bit version of ISO 2022 is a subset of 8bit version. It does not use C1 and GR. Explanation on C0 and C1 is omitted here. [1] Compound text is a standard for text exchange between X clients. [2] WHAT IS THE VALUE OF THESE CONTROL CODES? 4.3.1. EUC (Extended Unix Code) ------------------------------- _EUC_ is a CES which is a subset of 8bit version of ISO 2022 except for the usage of SS2 and SS3 code. Though these codes are used to invoke G2 and G3 into GL in ISO 2022, they are invoked into GR in EUC. _EUC-JP_, _EUC-KR_, _EUC-CN_, and _EUC-TW_ are widely used encodings which use EUC as CES. EUC is stateless. EUC can contain 4 CCS by using G0, G1, G2, and G3. Though there is no requirement that ASCII is designated to G0, I don't know any EUC codeset in which ASCII is not designated to G0. For EUC with G0-ASCII, all codes other than ASCII are encoded in 0x80 - 0xff and this is upward compatible to ASCII. Expressions for characters in G0, G1, G2, and G3 character sets are described below in binary: * G0: 0??????? * G1: 1??????? [1??????? [...]] * G2: SS2 1??????? [1??????? [...]] * G3: SS3 1??????? [1??????? [...]] where SS2 is 0x8e and SS3 is 0x8f. 4.3.2. ISO 2022-compliant Character Sets ---------------------------------------- There are many national and international standards of coded character sets (CCS). Some of them are ISO 2022-compliant and can be used in ISO 2022 encoding. ISO 2022-compliant CCS are classified into one of them: * 94 characters * 96 characters * 94x94x94x... characters The most famous 94 character set is US-ASCII. Also, all ISO 646 variants are ISO 2022-compliant 94 character sets. All ISO 8859-* character sets are ISO 2022-compliant 96 character sets. There are many 94x94 character sets. All of them are related to CJK ideograms. _JISX 0208_ (aka JIS C 6226) National standard of Japan. 1978 version contains 6802 characters including Kanji (ideogram), Hiragana, Katakana, Latin, Greek, Cyrillic, numeric, and other symbols. The current (1997) version contains 7102 characters. _JISX 0212_ National standard of Japan. 6067 characters (almost of them are Kanji). This character set is intended to be used in addition to JISX 0208. _JISX 0213_ Japanese national standard. Released in 2000. This includes JISX 0208 characters and additional thousands of characters. Thus, this is intended to be an extension and a replacement of JISX 0208. This has two 94x94 character sets, one of them inclucdes JISX 0208 plus about 2000 characters and the another includes about 2400 characters. Exactly speaking, JISX 0213 is not a simple superset of JISX 0208 because a few tens of Kanji variants which is unified and share the same code points in JISX 0208 are dis-unified and have separate code points in JISX 0213. Share many characters with JISX 0212. _KSX 1001_ (aka KSC 5601) National standard of South Korea. 8224 characters including 2350 Hangul, Hanja (ideogram), Hiragana, Katakana, Latin, Greek, Cyrillic, and other symbils. Hanja are ordered in reading and Hanja with multiple readings are coded multiple times. _KSX 1002_ National standard of South Korea. 7659 characters including Hangul and Hanja. Intended to be used in addition to KSX 1001. _KPS 9566_ National standard of North Korea. Similar to KSX 1001. _GB 2312_ National standard of China. 7445 characters including 6763 Hanzi (ideogram), Latin, Greek, Cyrillic, Hiragana, Katakana, and other symbols. _GB 7589_ (aka GB2) National standard of China. 7237 Hanzi. Intended to be used in addition to GB 2312. _GB 7590_ (aka GB4) National standard of China. 7039 Hanzi. Intended to be used in addition to GB 2312 and GB 7589. _GB 12345_ (aka GB/T 12345, GB1 or GBF) National standard of China. 7583 characters. Traditional characters version which correspond to GB 2312 simplified characters. _GB 13131_ (aka GB3) National standard of China. Traditional characters version which correspond to GB 7589 simplified characters. _GB 13132_ (aka GB5) National standard of China. Traditional characters version which correspond to GB 7590 simplified characters. _CNS 11643_ National standard of Taiwan. Has 7 plains. Plain 1 and 2 includes all characters included in Big5. Plain 1 includes 6085 characters including Hanzi (ideogram), Latin, Greek, and other symbols. Plain 2 includes 7650. Number of character for plain 3 is 6184, plain 4 is 7298, plain 5 is 8603, plain 6 is 6388, and plain 7 is 6539. There is a 94x94x94 character set. This is _CCCII_. This is national standard of Taiwan. Now 73400 characters are included. (The number is increasing.) Non-ISO 2022-compliant character sets are introduced later in Section 4.5, `Other Character Sets and Encodings'. 4.3.3. ISO 2022-compliant Encodings ----------------------------------- There are many ISO 2022-compliant encodings which are subsets of ISO 2022. _Compound Text_ This is used for X clients to communicate each other, for example, copy-paste. _EUC-JP_ An EUC encoding with ASCII, JISX 0208, JISX 0201 Kana, and JISX 0212 coded character sets. There are many systems which does not support JISX 0201 Kana and JISX 0212. Widely used in Japan for POSIX systems. _EUC-KR_ An EUC encoding with ASCII and KSX 1001. _CN-GB_ (aka EUC-CN) An EUC encoding with ASCII and GB 2312. The most popular encoding in R. P. China. This encoding is sometimes referred as simply 'GB'. _EUC-TW_ An extended EUC encoding with ASCII, CNS 11643 plain 1, and other (2-7) plains of CNS 11643. _ISO 2022-JP_ Described in. RFC 1468 (http://www.faqs.org/rfcs/rfc1468.html). ***** Not written yet ***** _ISO 2022-JP-1_ (upward compatible to ISO 2022-JP) Described in RFC 2237 (http://www.faqs.org/rfcs/rfc2237.html). ***** Not written yet ***** _ISO 2022-JP-2_ (upward compatible to ISO 2022-JP-1) Described in RFC 1554 (http://www.faqs.org/rfcs/rfc1554.html). ***** Not written yet ***** _ISO 2022-KR_ aka Wansung. Described in RFC 1557 (http://www.faqs.org/rfcs/rfc1557.html). ***** Not written yet ***** _ISO 2022-CN_ Described in RFC RFC 1922 (http://www.faqs.org/rfcs/rfc1922.html). ***** Not written yet ***** Non-ISO 2022-compliant encodings are introduced later in Section 4.5, `Other Character Sets and Encodings'. 4.4. ISO 10646 and Unicode -------------------------- ISO 10646 and Unicode are an another standard so that we can develop international softwares easily. The special features of this new standard are: * A united single CCS which intends to include all characters in the world. (ISO 2022 consists of multiple CCS.) * The character set intends to cover all conventional (or _legacy_) CCS in the world. [1] * Compatibility with ASCII and ISO 8859-1 is considered. * Chinese, Japanese, and Korean ideograms are united. This comes from a limitation of Unicode. This is not a merit. ISO 10646 is an official international standard. Unicode is developed by Unicode Consortium (http://www.unicode.org). These two are almost identical. Indeed, these two are exactly identical at code points which are available in both two standards. Unicode is sometimes updated and the newest version is 3.0.1. [1] This is obviously not true for CNS 11643 because CNS 11643 contains 48711 characters while Unicode 3.0.1 contains 49194 characters, only 483 excess than CNS 11643. 4.4.1. UCS as a Coded Character Set ----------------------------------- ISO 10646 defines two CCS (coded character sets), _UCS-2_ and _UCS-4_. UCS-2 is a subset of UCS-4. UCS-4 is a 31bit CCS. These 31 bits are divided into 7, 8, 8, and 8 bits and each of them has special term. * The top 7 bits are called _Group_. * Next 8 bits are called _Plane_. * Next 8 bits are _Row_. * The smallest 8 bits are _Cell_. The first plane (Group = 0, Plane = 0) is called _BMP_ (Basic Multilingual Plane) and UCS-2 is same to BMP. Thus, UCS-2 is a 16bit CCS. Code points in UCS are often expressed as _u+`????'_, where `????' is hexadecimal expression of the code point. Characters in range of u+0021 - u+007e are same to ASCII and characters in range of u+0xa0 - u+0xff are same to ISO 8859-1. Thus it is very easy to convert between ASCII or ISO 8859-1 and UCS. Unicode (version 3.0.1) uses a 20bit subset of UCS-4 as a CCS. [1] The unique feature of these CCS compared with other CCS is _open repertoire_. They are developing even after they are released. Characters will be added in future. However, already coded characters will not changed. Unicode version 3.0.1 includes 49194 distinct coded characters. [1] Exactly speaking, u+000000 - u+10ffff. 4.4.2. UTF as Character Encoding Schemes ---------------------------------------- A few CES are used to construct encodings which use UCS as a CCS. They are _UTF-7_, _UTF-8_, _UTF-16_, _UTF-16LE_, and _UTF-16BE_. UTF means Unicode (or UCS) Transformation Format. Since these CES always take UCS as the only CCS, they are also names for encodings. [1] [1] Compare UTF and EUC. There are a few variants of EUC whose CCS are different (EUC-JP, EUC-KR, and so on). This is why we cannot call EUC as an encoding. In other words, calling of 'EUC' cannot specify an encoding. On the other hands, 'UTF-8' is the name for a specific concrete encoding. 4.4.2.1. UTF-8 -------------- UTF-8 is an encoding whose CCS is UCS-4. UTF-8 is designed to be upward-compatible to ASCII. UTF-8 is multibyte and number of bytes needed to express one character is from 1 to 6. Conversion from UCS-4 to UTF-8 is performed using a simple conversion rule. UCS-4 (binary) UTF-8 (binary) 00000000 00000000 00000000 0??????? 0??????? 00000000 00000000 00000??? ???????? 110????? 10?????? 00000000 00000000 ???????? ???????? 1110???? 10?????? 10?????? 00000000 000????? ???????? ???????? 11110??? 10?????? 10?????? 10?????? 000000?? ???????? ???????? ???????? 111110?? 10?????? 10?????? 10?????? 10?????? 0??????? ???????? ???????? ???????? 1111110? 10?????? 10?????? 10?????? 10?????? 10?????? Note the shortest one will be used though longer representation can express smaller UCS values. UTF-8 seems to be one of the major candidates for standard codesets in the future. For example, Linux console and xterm supports UTF-8. Debian package of `locales' (version 2.1.97-1) contains `ko_KR.UTF-8' locale. I think the number of UTF-8 locale will increase. 4.4.2.2. UTF-16 --------------- UTF-16 is an encoding whose CCS is 20bit Unicode. Characters in BMP are expressed using 16bit value of code point in Unicode CCS. There are two ways to express 16bit value in 8bit stream. Some of you may heard a word _endian_. _Big endian_ means an arrangement of octets which are part of a datum with many bits from most significant octet to least significant one. _Little endian_ is opposite. For example, 16bit value of `0x1234' is expressed as `0x12 0x34' in big endian and `0x34 0x12' in little endian. UTF-16 supports both endians. Thus, Unicode character of `u+1234' can be expressed either in `0x12 0x34' or `0x34 0x12'. Instead, the UTF-16 texts have to have a _BOM (Byte Order Mark)_ at first of them. The Unicode character `u+feff' zero width no-break space is called BOM when it is used to indicate the byte order or endian of texts. The mechanism is easy: in big endian, `u+feff' will be `0xfe 0xff' while it will be `0xff 0xfe' in little endian. Thus you can understand the endian of the text by reading the first two bytes. [1] Characters not included in BMP are expressed using _surrogate pair_. Code points of `u+d800' - `u+dfff' are reserved for this purpose. At first, 20 bits of Unicode code point are divided into two sets of 10 bits. The significant 10 bits are mapped to 10bit space of `u+d800' - `u+dbff'. The smaller 10 bits are mapped to 10bit space of `u+dc00' - `u+dfff'. Thus UTF-16 can express 20bit Unicode characters. [1] I heard that BOM is mere a suggestion by a vendor. Read Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux (http://www.cl.cam.ac.uk/~mgk25/unicode.html) for detail. 4.4.2.3. UTF-16BE and UTF-16LE ------------------------------ UTF-16BE and UTF-16LE are variants of UTF-16 which are limited to big and little endians, respectively. 4.4.2.4. UTF-7 -------------- UTF-7 is designed so that Unicode can be communicated using 7bit communication path. ***** Not written yet ***** 4.4.2.5. UCS-2 and UCS-4 as encodings ------------------------------------- Though I introduced UCS-2 and UCS-4 are CCS, they can be encodings. In UCS-2 encoding, Each UCS-2 character is expressed in two bytes. In UCS-4 encoding, Each UCS-4 character is expressed in four bytes. 4.4.3. Problems on Unicode -------------------------- All standards are not free from politics and compromise. Though a concept of united single CCS for all characters in the world is very nice, Unicode had to consider compatibility with preceding international and local standards. And more, unlike the ideal concept, Unicode people considered efficiency too much. IMHO, surrogate pair is a mess caused by lack of 16bit code space. I will introduce a few problems on Unicode. 4.4.3.1. Han Unification ------------------------ This is the point on which Unicode is criticized most strongly among many Japanese people. A region of 0x4e00 - 0x9fff in UCS-2 is used for Eastern-Asian ideographs (Japanese Kanji, Chinese Hanzi, and Korean Hanja). There are similar characters in these four character sets. (There are two sets of Chinese characters, simplified Chinese used in P. R. China and traditional Chinese used in Taiwan). To reduce the number of these ideograms to be encoded (the region for these characters can contain only 20992 characters while only Taiwan CNS 11643 standard contains 48711 characters), these similar characters are assumed to be the same. This is Han Unification. However these characters are not exactly the same. If fonts for these characters are made from Chinese one, Japanese people will regard them wrong characters, though they may be able to read. Unicode people think these united characters are the same character with different glyphs. An example of Han Unification is available at U+9AA8 (http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9AA8). This is a Kanji character for 'bone'. U+8FCE (http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=8FCE) is an another example of a Kanji character for 'welcome'. The part from left side to bottom side is 'run' radical. 'Run' radical is used for many Kanjis and all of them have the same problem. U+76F4 (http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=76F4) is an another example of a Kanji character for 'straight'. I, a native Japanese speaker, cannot recognize Chiense version at all. Unicode font vendors will hesitate to choose fonts for these characters, simplified Chinese character, traditional Chinese one, Japanese one, or Korean one. One method is to supply four fonts of simplified Chinese version, traditional Chinese version, Japanese version, and Korean version. Commercial OS vendor can release localized version of their OS --- for example, Japanese version of MS Windows can include Japanese version of Unicode font (this is what they are exactly doing). However, how should XFree86 or Debian do? I don't know... [1] [2] [1] XFree86 4.0 includes Japanese and Korean versions of ISO 10646-1 fonts. [2] I heard that Chinese and Korean people don't mind the glyph of these characters. If this is always true, Japanese glyphs should be the default glyphs for these problematic characters for international systems such as Debian. 4.4.3.2. Cross Mapping Tables ----------------------------- Unicode intents to be a superset of all major encodings in the world, such as ISO-8859-*, EUC-*, KOI8-*, and so on. The aim of this is to keep round-trip compatibility and to enable smooth migration from other encodings to Unicode. Only providing a superset is not sufficient. Reliable cross mapping tables between Unicode and other encodings are needed. They are provided by Unicode Consortium (http://www.unicode.org/Public/MAPPINGS/). However, tables for East Asian encodings are not provided. They were provided but now are obsolete (http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/). You may want to use these mapping tables even though they are obsolete, because there are no other mapping tables available. However, you will find a severe problem for these tables. There are multiple different mapping tables for Japanese encodings which include JIS X 0208 character set. Thus, one same character in JIS X 0208 will be mapped into different Unicode characters according to these mapping tables. For example, Microsoft and Sun use different table, which results in Java on MS Windows sometimes break Japanese characters. Though we Open Source people should respect interoperativity, we cannot achieve sufficient interoperativity because of this problem. All what we can achieve is interoperativity between Open Source softwares. GNU libc uses JIS/JIS0208.TXT (http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT) with a small modification. The modification is that * original JIS0208.TXT: 0x815F 0x2140 0x005C # REVERSE SOLIDUS * modified: 0x815F 0x2140 0xFF3C # FULLWIDTH REVERSE SOLIDUS The reason of this modification is that JIS X 0208 character set is almost always used with combination with ASCII in form of EUC-JP and so on. ASCII 0x5c, not JIS X 0208 0x2140, should be mapped into U+005C. This modified table is found at `/usr/share/i18n/charmaps/EUC-JP.gz' in Debian system. Of course this mapping table is NOT authorized nor reliable. I hope Unicode Consortium to release an authorized reliable unique mapping table between Unicode and JIS X 0208. You can read the detail of this problem (http://www.debian.or.jp/~kubota/unicode-symbols.html). 4.4.3.3. Combining Characters ----------------------------- Unicode has a way to synthesize a accented character by combining an accent symbol and a base character. For example, combining 'a' and '~' makes 'a' with tilde. More than two accent symbol can be added to a base character. Languages such as Thai need combining characters. Combining characters are the only method to express characters in these languages. However, a few problems arises. Duplicate Encoding There are multiple ways to express the same character. For example, u with umlaut can be expressed as `u+00fc' and also as `u+0075' + `U+0308'. How can we implement 'grep' and so on? Open Repertoire Number of expressible characters grows unlimitedly. Non-existing characters can be expressed. 4.4.3.4. Surrogate Pair ----------------------- The first version of Unicode had only 16bit code space, though 16bit is obviously insufficient to contain all characters in the world. [1] Thus surrogate pair is introduced in Unicode 2.0, to expand the number of characters, with keeping compatibility with former 16bit Unicode. However, surrogate pair breaks the principle that all characters are expressed with the same width of bits. This makes Unicode programming more difficult. Fortunately, Debian and other UNIX-like systems will use UTF-8 (not UTF-16) as a usual encoding for UCS. Thus, we don't need to handle UTF-16 and surrogate pair very often. [1] There are a few projects such as Mojikyo (http://www.mojikyo.gr.jp/) (about 90000 characters), TRON project (http://www.tron.org/index-e.html) (about 130000 characters), and so on to develop a CCS which contains sufficient characters for professional usage in CJK world. 4.4.3.5. ISO 646-* Problem -------------------------- You will need a codeset converter between your local encodings (for example, ISO 8859-* or ISO 2022-*) and Unicode. For example, Shift-JIS encoding [1] consists from JISX 0201 Roman (Japanese version of ISO 646), not ASCII, which encodes yen currency mark at `0x5c' where backslash is encoded in ASCII. Then which should your converter convert `0x5c' in Shift-JIS into in Unicode, `u+005c' (backslash) or `u+00a5' (yen currency mark)? You may say yen currency mark is the right solution. However, backslash (and then yen mark) is widely used for escape character. For example, 'new line' is expressed as 'backslash - `n'' in C string literal and Japanese people use 'yen currency mark - `n''. You may say that program sources must written in ASCII and the wrong point is that you tried to convert program source. However, there are many source codes and so on written in Shift-JIS encoding. Now Windows comes to support Unicode and the font at `u+005c' for Japanese version of Windows is yen currency mark. As you know, backslash (yen currency mark in Japan) is vitally important for Windows, because it is used to separate directory names. Fortunately, EUC-JP, which is widely used for UNIX in Japan, includes ASCII, not Japanese version of ISO 646. So this is not problem because it is clear `0x5c' is backslash. Thus all local codesets should not use character sets incompatible to ASCII, such as ISO 646-*. Problems and Solutions for Unicode and User/Vendor Defined Characters (http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html) discusses on this problem. [1] The standard encoding for Macintosh and MS Windows. 4.5. Other Character Sets and Encodings --------------------------------------- Besides ISO 2022-compliant coded character sets and encodings described in Section 4.3.2, `ISO 2022-compliant Character Sets' and Section 4.3.3, `ISO 2022-compliant Encodings', there are many popular encodings which cannot be classified into an international standard (i.e., not ISO 2022-compliant nor Unicode). Internationalized softwares should support these encodings (again, you don't need to be aware of encodings if you use LOCALE and `wchar_t' technology). Some organizations are developing systems which go father than limitations of the current international standards, though these systems may be not diffused very much so far. 4.5.1. Big5 ----------- _Big5_ is a de-facto standard encoding for Taiwan (1984) and is upward-compatible with ASCII. It is also a CCS. In Big5, `0x21' - `0x7e' means ASCII characters. `0xa1' - `0xfe' makes a pair with the following byte (`0x40' - `0x7e' and `0xa1' - `0xfe') and means an ideogram and so on (13461 characters). Though Taiwan has ISO 2022-compliant new standard CNS 11643, Big5 seems to be more popular than CNS 11643. (CNS 11643 is a CCS and there are a few ISO 2022-derived encodings which include CNS 11643.) 4.5.2. UHC ---------- _UHC_ is an encoding which is an upward-compatible with _EUC-KR_. Two-byte characters (the first byte: `0x81' - `0xfe'; the second byte: `0x41' - `0x5a', `0x61' - `0x7a', and `0x81' - `0xfe') include KSX 1001 and other Hangul so that UHC can express all 11172 Hangul. 4.5.3. Johab ------------ _Johab_ is an encoding whose character set is identical with _UHC_, i.e., ASCII, KSX 1001, and all other Hangul character. Johab means combination in Korean. In Johab, code point of a Hangul can be calculated from combination of Hangul parts (Jamo). 4.5.4. HZ, aka HZ-GB-2312 ------------------------- _HZ_ is an encoding described in RFC 1842 (http://www.faqs.org/rfcs/rfc1842.html). CCS (Coded character sets) of HZ is ASCII and GB2312. This is 7bit encoding. Note that HZ is _not_ upward-compatible with ASCII, since '`~{'' means GB2312 mode, '`~}'' means ASCII mode, and '`~~'' means ASCII '~'. 4.5.5. GBK ---------- _GBK_ is an encoding which is upward-compatible to CN-GB. GBK covers ASCII, GB2312, other Unicode 1.0 ideograms, and a bit more. The range of two-byte characters in GBK is: `0x81' - `0xfe' for the first byte and `0x40' - `0x7e' and `0x80' - `0xfe' for the second byte. 21886 code points out of 23940 in two-byte region are defined. GBK is one of popular encodings in R. P. China. 4.5.6. GB18030 -------------- _GB 18030_ is an encoding which is upward-compatible to GBK and CN-GB. It is an recent national standard (released on 17 March 2000) of China. It adds four-byte characters to GBK. Its range is: `0x81' - `0xfe' for the first byte, `0x30' - `0x39' for the second byte, `0x81' - `0xfe' for the third byte, and `0x30' - `0x39' for the forth byte. It includes all characters of Unicode 3.0's Unihan Extension A. And more, GB 18030 supplies code space for all used and unused code points of Unicode's plane 0 (BMP) and 16 additional planes. A detailed explanation on GB18030 (ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf) is available. 4.5.7. GCCS ----------- _GCCS_ is a standard of coded character set by Hong Kong (HKSAR: Hong Kong Special Administrative Region). It includes 3049 characters. It is an abbreviation of Government Common Character Set. It is defined as an _additional character set for Big5_. Characters in GCCS are coded in User-Defined Area (just like Private Use Area for UCS) in Big5. 4.5.8. HKSCS ------------ _HKSCS_ is an expansion and amendment of GCCS. It includes 4702 characters. It means Hong Kong Supplementary Character Set. In addition to a usage in User-Defined Area in Big5, HKSCS defines a usage in Private Use Area in Unicode. 4.5.9. Shift-JIS ---------------- _Shift-JIS_ is one of popular encodings in Japan. Its CCS are JISX 0201 Roman, JISX 0201 Kana, and JISX 0208. JISX 0201 Roman is Japanese version of ISO 646. It defines yen currency mark for `0x5c', where ASCII has backslash. `0xa1' - `0xdf' is one-byte character and is JISX 0201 Kana. Two-byte character (the first byte: `0x81' - `0x9f' and `0xe0' - `0xef'; the second byte: `0x40' - `0x7e' and `0x80' - `0xfc') is JISX 0208. Japanese version of MS DOS, MS Windows and Macintosh use this encoding, though this encoding is not often used in POSIX systems. 4.5.10. VISCII -------------- Vietnamese language uses 186 characters (Latin alphabets with accents) and other symbols. It is a bit more than the limit of ISO 8859-like encoding. _VISCII_ is a standard for Vietnamese. It is upward-compatible with ASCII. It is 8bit and stateless, like ISO 8859 series. However, it uses code points of not only `0x21' - `0x7e' and `0xa0' - `0xff' but also `0x02', `0x05', `0x06', `0x14', `0x19', `0x1e', and `0x80' - `0x9f'. This makes VISCII not-ISO 2022-compliant. Vietnam has a new, ISO 2022-compliant character set _TCVN 5712 VN2_ (aka _VSCII_). In TCVN 5712 VN2, accented characters are expressed as a combined character. Note that some of accented characters have their own code points. 4.5.11. TRON ------------ TRON (http://www.tron.org/index-e.html) is a project to develop a new operating system, founded as a collaboration of industries and academics in Japan since 1984. The most diffused version of TRON operating system families is ITRON, a real-time OS for embedded systems. However, our interest is not on ITRON now. TRON determines a TRON encoding. TRON's encoding is stateful. Each state is assigned to each language. It has already defined about 130000 characters (January 2000). 4.5.12. Mojikyo --------------- Mojikyo (http://www.mojikyo.gr.jp/) is a project to develop an environment by which a user can use many characters in the world. Mojikyo project has released an application software for MS Windows to display and input about 90000 characters. You can download the software and TrueType, TeX, and CID fonts, though they are not DFSG-free. ------------------------------------------------------------------------------- 5. Characters in Each Country ----------------------------- This chapter describes a specific information for each language. If you are developing a serious DTP software or planning to support detailed I18N, this chapter may help you. Contributions from people speaking each language are welcome. If you are to write a section on your language, please include these points: 1. kinds and number of characters used in the language, 2. explanation on coded character set(s) which is (are) standardized, 3. explanation on encoding(s) which is (are) standardized, 4. usage and popularity for each encoding, 5. de-facto standard, if any, on how many columns characters occupy, 6. writing direction and combined characters, 7. how to layout characters (word wrapping and so on), 8. widely used value for `LANG' environmental variable, 9. the way to input characters from keyboard and whether you want to input yes/no (and so on) in your language or in English, 10. a set of information needed for beautiful displaying, for example, where to break a line, hyphenation, word wrapping, and so on, and 11. other topics. Writers whose languages are written in different direction from European languages or needs a combined characters (I heard that is used in Thai) are encouraged to explain how to treat such languages. 5.1. Japanese language / used in Japan -------------------------------------- This section is the text written by Tomohiro KUBOTA . Japanese is the only official language used in Japan. People in Okinawa islands and Ainu ethnic group in Hokkaido region have each language, though they are used among few number of people and they don't have own letters. Japan is the only region where Japanese language is widely used. 5.1.1. Characters used in Japanese ---------------------------------- There are three kinds of characters used in Japan, Hiragana, Katakana, and Kanji. Arabic numerical characters (same as European languages) are widely used in Japanese, though we have Kanji numerical characters. Though Latin alphabets are not a part of Japanese characters, they are widely used for proper nouns for companies and so on. Hiragana and Katakana are phonogram derived from Kanji. Hiragana and Katakana characters have one-to-one correspondence each other like upper and lower case of Latin alphabets. However, `toupper()' and `tolower()' should not convert Hiragana and Katakana each other. Hiragana contains about 100 characters and of course Katakana does. (FYI: about 50 regular characters, 20 characters with voiced consonant symbol, 5 characters with semi-voiced consonant symbol, and 9 small characters.) Kanji is ideogram imported from China roughly about 1 - 2 thousands years ago. Nobody knows the whole number of Kanji and almost all of adult Japanese people know several thousands of Kanji characters. Though the origin of Kanji is Chinese character, shapes are changed from original ancient Chinese Kanji. Almost all Kanji have several ways to read, according to the word the Kanji is contained. 5.1.2. Character Sets --------------------- JIS (Japan Industrial Standards) is an organization responsible for coded character sets (CCS) and encodings used in Japan. The major coded character sets in Japan are: * JIS X 0201-1976 Roman characters (Almost same to ASCII but 0x5c is Yen mark instead of backslash and 0x7e is upper bar instead of tilde) * JIS X 0201-1976 Kana (about 60 KATAKANA characters), * JIS X 0208-1997 1st and 2nd levels (about 7000 characters including symbols, numeric characters, Latin, Cyrillic and Greek alphabets, Japanese HIRAGANA, KATAKANA, and KANJI), * JIS X 0212 (about 6000 characters including KANJI, which are not included in JIS X 0208), and * JIS X 0213:2000 (aka JIS 3rd and 4th levels). _JIS X 0201 Roman_ is the Japanese version of ISO 646. Though JIS X 0201 is included in SHIFT-JIS encoding (explained later) and widely used for Windows/Macintosh, usage of this is not encouraged in UNIX. _JIS X 0201 Kana_ defines about 60 KATAKANA characters. This is widely used by old 8bit computers. In deed, SHIFT-JIS encoding was designed to be upward-compatible with 8-bit encoding of JISX 0201 Roman and JISX 0201 Kana. Note this CCS is not included in ISO 2022-JP encoding which is used for e-mail and so on. _JIS X 0212_ is not widely used, probably because it cannot be included in SHIFT-JIS, the standard encoding for Japanese version of Windows and Macintosh. And more, this CCS may be obsolete when JIS X 0213 will be popular, since JIS X 0213 has many characters which are included in JIS X 0212. However, the advantage of JIS X 0212 over JIS X 0213 is that all characters in JIS X 0212 are included in the current Unicode (version 3.0.1) while not all characters in JIS X 0213 are. _JIS X 0208_ (aka JIS C 6226) is the main standard for Japanese characters. Strictly speaking, it was originally defined in 1978 and revised on 1983, 1990, and 1997. Though 1997 version has 77 more characters than original 1976 version and shape of more than 200 characters are changed, almost softwares don't have to care about the difference between them. However, be careful of that ISO-2022-JP encoding (explained below) contains both JIS X 0208-1978 and JIS X 0208-1983. 1978 version is called 'old JIS' and later is called 'new JIS'. Characters in JIS X 0208 are divided into two levels, 1st and 2nd. Old 8bit computers rarely implemented the 2nd level. Usage of numeric characters and Latin alphabets in JIS X 0208 is not encouraged because these characters are also included in ASCII and JIS X 0201 Roman, either of which is included in all encodings. When converting into Unicode, these characters are mapped into 'fullwidth forms'. All of these coded character sets (except for JIS X 0213) are included in Unicode 3.0.1. A part of JIS X 0213 characters are not included in Unicode 3.0.1. There are a few different tables for conversion between non-letter characters in JIS X 0208 and Unicode. This is a problem because this may deny 'round-trip compatiblilty'. Problems and Solutions for Unicode and User/Vendor Defined Characters (http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html) discusses this problem in detail. 5.1.3. Encodings ---------------- There are three popular encodings widely used in Japan. * ISO-2022-JP (aka JIS code or JUNET code) * stateful * subset of 7bit version of ISO-2022, where ASCII, JIS X 0201-1976 Roman, JIS X 0208-1978, and JIS X 0208-1983 are supported. * 7bit, which means the most significant bit (MSB) of each byte is always zero. * used for e-mail and net-news and preferred for HTML. * Determined in RFC 1468. * EUC-JP (Japanese version of Extended UNIX Code) * stateless * an implementation of EUC where G0, G1, G2, and G3 are ASCII, JIS X 0208, JIS X 0201 Kana, and JIS X 0212 respectively. There are many implementation which cannot use JIS X 0201 Kana and JIS X 0212. * 8bit * preferred encoding for UNIX. For example, almost all Japanese message catalogs for gettext is written in EUC-JP. * Japanese code is mapped in `0xa0' - `0xff'. This is important for programmer because one doesn't need to care there are fake '\' or '/' (which can be treated in a special way in various context) in the Japanese code. * SHIFT-JIS (aka Microsoft Kanji Code) * stateless * NOT subset of ISO-2022 * 8bit * JIS X 0201 Roman, JIS X 0201 Kana, and JIS X 0208 can be expressed, but JIS X 0212 cannot. * The standard encoding for Windows/Macintosh. This makes SHIFT-JIS the most popular encoding in Japan. Though MS is thinking about transition to UNICODE, it is suspicious that it can be done successfully. _ISO-2022-JP_ is a subset of 7bit version of ISO 2022, where only G0 is used and G0 is assumed to be invoked into GL. Character sets included in ISO-2022-JP are: * ASCII (ESC 0x28 0x42), * JIS X 0201-1976 Roman (ESC 0x28 0x4a), * JIS X 0208-1978 (old JIS) (ESC 0x24 0x40), and * JIS X 0208-1983 (new JIS) (ESC 0x24 0x42). Note that JIS X 0208-1978 and JIS X 0208-1983 are almost identical and ASCII and JIS X 0201-1976 Roman are also almost identical. A line (stream of bytes between 'newline' control code) must start by ASCII status and to end by ASCII status. See Section 4.3, `ISO 2022' for detail. _ISO-2022-JP-2_ (RFC 1554) is a subset of 7bit version of ISO 2022 and superset of ISO-2022-JP. Difference between ISO-2022-JP and ISO-2022-JP-2 is that ISO-2022-JP-2 has more coded character sets than ISO-2022-JP. Character sets included in ISO-2022-JP-2 are: * ASCII (ESC 0x28 0x42) * JIS X 0201-1976 Roman (ESC 0x28 0x4a), * JIS X 0208-1978 (old JIS) (ESC 0x24 0x40), * JIS X 0208-1983 (new JIS) (ESC 0x24 0x42), * GB2312-80 (simplified Chinese) (ESC 0x24 0x41), * KS C 5601 (Korean) (ESC 0x24 0x28 0x43), * JIS X 0212-1990 (ESC 0x24 0x28 0x44), * ISO 8859-1 (Latin-1) (ESC 0x2e 0x41), and * ISO 8859-7 (Greek) (ESC 0x2e 0x46). Though JIS X 0212-1990 may sometimes be used, ISO-2022-JP-2 is rarely used. _ISO-2022-INT-1_ is a superset of ISO-2022-JP-2 which has CNS 11643-1986-1 and CNS 11643-1986-2 (traditional Chinese). _EUC-JP_ is a version of EUC, where G0 is ASCII, G1 is JIS X 0208, G2 is JIS X 0201 Kana, and G3 is JIS X 0212. G2 and G3 are sometimes not implemented. This is the most popular encoding for Linux/Unix. See Section 4.3.1, `EUC (Extended Unix Code)' for detail. _SHIFT-JIS_ is designed to be a superset of encodings for old 8bit computers which includes JIS X 0201 Roman and JIS X 0201 Kana. `0x20' - `0x7f' is JIS X 0201 Roman and `0xa0' - `0xdf' is JIS X 0201 Kana. `0x80' - `0x9f' and `0xe0' - `0xff' is the first byte of doublebyte characters. The second byte is `0x40' - `0x7e' and `0x80' - `0xfc'. This code space is used for JIS X 0208. UNICODE is not popular in Japan at all, probably because conversion from these codes into Unicode is a bit difficult. However MS Windows uses Unicode in a limited field, for example, internal code for file names. I guess more and more softwares will come to support Unicode in the future. You can convert files written in these encodings one another using `nkf' or `kcc' package. Using options `-j', `-s', and `-e', `nkf' convert a file into ISO-2022-JP (aka JIS), SHIFT-JIS (aka MS-KANJI), and EUC-JP, respectively. Note that difference between JIS X 0201 Roman and ASCII is ignored. Though `nkf' can guess the encoding of the input file, you can specify the encoding by command option. This is because there are no algorithm to completely distinguish EUC-JP and SHIFT-JIS, though `nkf' usually guesses correctly. `tcs' can also convert these encodings, though without guessing input encoding. Conversion between these encodings can be done with a simple algorithm since all of them are based on the same character sets. You need a table for code conversion between these encodings and Unicode. 5.1.4. How These Encodings Are Used --- Information for Programmers ------------------------------------------------------------------- Since EUC-JP is widely used for UNIX, EUC-JP should be supported. Exceptions are shown below. Of course direct implementation of knowledge on EUC-JP is not encouraged. If you can implement without the knowledge by use of `wchar_t' and so on, you should do so. * the body of mail and news messages must be written in ISO-2022-JP. * De-facto standard of ICQ is SHIFT-JIS. * WWW browser must recognize all encodings. * Softwares which communicate with Windows/Macintosh should use SHIFT-JIS. * SHIFT-JIS is widely used for BBS. (BBS is a service like Compuserve). * File names for Joliet-format CD-ROM used for Windows is written in Unicode. 5.1.5. Columns -------------- In consoles which are able to display Japanese characters (kon, jfbterm, kterm, krxvt, and so on), characters in JIS X 0201 (Roman and Kana) occupy 1 column and characters in JIS X 0208, JIS X 0212, and JIS X 0213 occupy 2 columns. 5.1.6. Writing Direction and Combined Characters ------------------------------------------------ Japanese language can be written in vertical direction. A line goes downward and the row of lies goes from right to left. This direction is the traditional style. For example, most Japanese books, magazines and newspapers except for in the field of natural science (or ones containing many Latin words or equations) are written in vertical direction. Thus a word processor is strongly recommended to support this direction. DTP systems which don't support this direction are almost useless. Japanese language can also written in the same direction to Latin languages. Japanese books and magazines on science and technology are written in this direction. It is enough for almost usual softwares to support this direction only. A few Japanese characters have to have different fonts for vertical direction. They are reasonable characters --- parentheses and 'long syllable' symbol whose shape is like dash in English or mathematical 'minus' sign. Symbols equivalent to period and comma also have different style for horizontal and vertical direction. In Japan, Arabic numerical characters are widely used, like European languages, though we have Kanji (ideogram) numerical characters. Latin characters can also appear in Japanese texts. If a row of 1 - 3 (or 4) characters of Arabic and Latin appear in Japanese vertical text, these characters can be crowded into one column. If more characters appear (large numbers or long words), the paper is rotated 90 degree in anticlockwise and the characters are written in European way. Sometimes Latin characters which appears in vertical text are written in the same way as Japanese character, i.e., vertical direction. This is not so strong custom. Arabic and Latin characters can always be written in both normal and rotated way in vertical text. [1] DTP system should support all of them. A version of Japanized TeX (developed by ASCII, a publishing company in Japan) can use vertical direction. This can also treat a page containing both vertical and horizontal texts. [1] I HAVE TO SHOW EXAMPLE USING GRAPHICS. 5.1.7. Layout of Characters --------------------------- In Japanese language, words are not separated by space and a line can be broken anywhere, with a few exceptions, unlike European languages. Thus hyphenation is not needed for Japanese. Characters like open parentheses cannot come to the end of a line. Characters like close parentheses and sorts of sentence-separating marks such as period and comma cannot come to the top of a line. This rule and processing is called 'kinsoku' in Japanese. In European languages, a break of line is equivalent to a space. In Japanese language, a break of line should be neglected. For example, when rendering an HTML file, line-breaking character in the HTML source should not be converted into whitespace. 5.1.8. LANG variable -------------------- Different value of `LANG' used for different encodings. Following values are used for EUC-JP. * `LANG=ja_JP.eucJP' (major for Linux and *BSD) * `LANG=ja_JP.ujis' (used to be major for Linux) * `LANG=ja_JP' (because EUC-JP is the de-facto standard for UNIX; not recommended) * `LANG=ja' (because EUC-JP is the de-facto standard for UNIX; not recommended) `LANG=ja_JP.jis' is used for ISO-2022-JP (aka JIS code or JUNET code). `LANG=ja_JP.sjis' is used for SHIFT-JIS (aka Microsoft Kanji Code). Setting LANG is not sufficient for a Japanese user who has just installed Linux to get a minimal Japanese environment. There are several books on establishing Japanese environment on Linux/BSD and magazines on Linux often have feature articles on how to establish Japanese environment. Nowadays many Japanized Linux distributions which are optimized so that many basic software can display and input Japanese are popular. Debian GNU/Linux has `user-ja' (for potato) and `language-env' (for woody and following versions) packages to establish basic Japanese environment. 5.1.9. Input from Keyboard -------------------------- Since Japanese characters cannot be inputed directly from a keyboard, a software is needed to convert ASCII characters into Japanese. `WNN', `Canna', and `SKK' are popular free softwares to input Japanese language. Though `T-Code' is also available, it is difficult to use. Since these adopt server/client model and implement their own protocols, we cannot input Japanese only with `wnn', `canna', or `skk' (and their depending packages). In X Window System environment, `kinput2-*' and `skkinput' packages connects these protocols and XIM, which is the standard input protocol for X. Kinput2 also has an original protocol and `kterm' and so on can be a client of kinput2 protocol. Kinput2 protocol was developed before international standards such as XIM (or Ximp or Xsi) became available. On console, there are no standard and each software has to support wnn and/or canna protocol. For example, `jvim-canna', `xemacs21-mule-canna', and emacs20 with `emacs-dl-canna' or `emacs-dl-wnn'. Thus the ways to operate are different between softwares. `skkfep' provides a general way to input Japanese on console. Then the way to input Japanese is explained. Since almost Hiraganas and Katakanas represents a pair of a vowel and a consonant with one character, we can input one Hiragana or one Katakana with two Latin alphabets. A few Hiraganas and Katakanas need one or three alphabets. Kanji is obtained by converting from Hiragana. There are many Japanese words which are expressed by two or more Kanjis and almost recent converting softwares can convert such words at a time. (Old softwares can convert one Kanji at a time. You must be patient to use this way.) Softwares with good grammar/context analyzer and large dictionary can convert longer phrases or even a whole sentence at a time. However, we usually have to select one Kanji or word from candidates the software shows, because Japanese language has many homophones. For example, 61 Kanjis whose readings are 'KAN' and 6 words whose readings are 'KOUKOU' are registered in dictionary of `canna'. (Today, 2 Oct 1999, I saw a TV advertisement film of Japanese word processor which insists the software can correctly convert an input into 'a cafe which opened today', not 'a cafe which rotated today'. Though Japanese word 'KAITEN' means both 'open (a shop)' and 'rotate', the software knows it is more usual for a cafe to open than to rotate.) The conversion from Hiragana to Kanji needs a large dictionary which contains the Kanji spelling and readings of Japanese major words and conjugation or inflection. Thus proprietary softwares tend to efficiently convert. They usually have dictionaries larger than few megabytes. Some of these recent proprietary softwares even analyze the topic or meaning of the inputed Hiragana sentence and choose the most appropriate homophone, though they often choose wrong ones. Nowadays several proprietary conversion softwares such as ATOK, WNN6, and VJE for Linux are sold in Japan. Since it is complex and hard work for users to input Japanese characters, we don't want to input Y (for YES) or N (for NO) in Japanese. We prefer learning such basic English words to inputing Japanese words by invoking conversion software, inputing Latin alphabetic expression of Japanese, converting it into Hiragana, converting it into Kanji, choosing the correct Kanji, determining the correct Kanji, and ending the conversion software each time we need to input yes or no or similar words. 5.1.10. More Detailed Discussions --------------------------------- 5.1.10.1. Width of Characters ----------------------------- Different from European languages, Japanese characters should written in a fixed width. Exceptions arises when two symbols such as parentheses, periods and commas continue. Kerning should be done for such cases if the software is a word processor. A text editor need not. 5.1.10.2. Ruby -------------- _Ruby_ is a small (usually 1/2 in length and 1/4 in area or a bit smaller) characters written above (in horizontal direction) or at right side (in vertical direction) of the main text. This is usually used to show a reading of difficult Kanji. Japanized TeX can use ruby by using an extra macro. Word processors should have Ruby faculty. 5.1.10.3. Upper And Lower Cases ------------------------------- Japanese character does not have upper and lower case although there two sets of phonograms, Hiragana and Katakana. Thus `tolower()' and `toupper()' should not convert between Hiragana and Katakana. Hiragana is used for usual text. Katakana is used mainly for express foreign or imported words, for example, KONPYU-TA for computer, MAIKUROSOFUTO for Microsoft, and AINSYUTAIN for Einstein. 5.1.10.4. Sorting ----------------- Phonograms (Hiragana and Katakana) have sorting order. The order is same to defined in JIS X 0208, with a few exceptions. Ideograms (Kanji) sorting is difficult. They should be sorted by their reading but almost all kanji have a few readings according to the context. So if you want to sort Japanese text, you will need a dictionary of whole Japanese Kanji words. And more, a few Japanese words written in Kanji have different readings with exactly same series of Kanjis, this can occur especially for names of person. So it is usual that addressbook databases have two 'name' columns, one for Kanji expression and the other for Hiragana. I know no softwares which can sort Japanese words in perfect way, including free and proprietary softwares. 5.1.10.5. Ro-ma ji (Alphabetic expression of Japanese) ------------------------------------------------------ We have a phonetic alphabetic expression of Japanese, Ro-ma ji. It has almost one-to-one correspondence to Japanese phonogram. It can be used to display Japanese text on Linux console and so on. Since Japanese have many homophones this expression can be crabbed. There are several variants of Ro-ma ji. The first distinguishing point is on handling of long syllable. For example, long syllable of 'E' is expressed in: * 'E' with caret, * 'E' with upper bar, * only 'E' in which long syllable mark is ignored, * 'EE', * and so on. The second distinguishing point is some special pairs of vowel and consonant. For example, Hiragana character for combination of 'T' and 'I' is pronounced like 'CHI'. * TI or CHI, as described above, * TU or TSU, * SI or SHI, * HU or FU, * WO or O, * TYA or CHA, and * N or M. 5.2. Spanish language / used in Spain, most of America and Equatorial Guinea ---------------------------------------------------------------------------- Section written by Eusebio C Rufian-Zilbermann . Spanish is one of the official languages in Spain, the official language in most of the countries in the American continent and the official language in Equatorial Guinea. It is spoken in many other regions where it is not the official language. Other official languages in Spain are Galician, Catalan and Basque. These other languages each have their own specific issues with regards to Localization. They are not described in this section of the document. The Spanish Language derives from the variation spoken in the Castille region. The term Castillian is sometimes used to refer to the Spanish language (particularly when an author wants to stress the fact that there are other languages spoken in Spain). Both Castillian and Spanish language refer to the same language, they are not different things. 5.2.1. Characters used in Spanish --------------------------------- Spanish uses a Latin alphabet. The numerical characters used in Spanish are the Arabic numerals. The character that distinguishes Spanish from other Latin alphabets is the N~ ('N' with tilde), which exists in uppercase and lowercase versions. Vowels in Spanish may have a mark (the accent) on top of them to indicate intensity intonation. This accent is required for orthography (written correctness) on lowercase vowels but it is optional in uppercase vowels. The letter 'u' may have a dieresis (like the German umlaut), both in uppercase and lowercase forms. Some punctuation signs are characteristic of the Spanish language. The opening question mark and the opening exclamation sign look like the English question mark and exclamation sign rotated 180 degrees. The English question mark and exclamation sign are referred to as closing question mark and exclamation sign. The small underlined 'a' and 'o' are used mainly for ordinal numbers, similar to the small 'th' in English ordinals. 5.2.2. Character Sets --------------------- UNE (Una Norma Espan~ola) is the National Standards Organization in Spain. UNE is a member of the ISO and standards that have one-to-one correspondence are usually called by their ISO number, rather than their UNE number. ISO 8859-1, also known as ISO Latin-1, contains the characters required for Spanish. 5.2.3. Codesets --------------- The codeset mostly used for Spanish is ISO 8859-1. The codepage Windows 1252 a.k.a. Windows Latin-1 is a superset of ISO 8859-1 that adds some characters in the range 128 to 159. Other codesets are Unicode, Macintosh Roman (codepage 1000), MS-DOS Latin-1 (codepage 850) or less frequently MS-DOS Latin US (codepage 437) which contains accented lowercase characters but not uppercase. Some additional Latin codesets are EBCDIC CP500 and CP 1026 (used in IBM mainframes and terminal emulators), Adobe Standard (used as default for Postscript documents), Nextstep Latin, HP Roman 8 (for HPUX and Laserjet resident printer fonts) and the Latin codepage in OS/2. They are all stateless, 8-bit codepages (with the exception of Unicode that is 16-bit). 5.2.4. How These Codesets Are Used --- Information for Programmers ------------------------------------------------------------------ In most cases it is safe to use ISO 8859-1 characters. Some exceptions are * WWW browsers should recognize all codesets. * Software which communicates with IBM mainframes, Macintosh, MS-DOS, Nextstep, HPUX, OS/2 should handle the corresponding encoding. * File names for Joliet-format CD-ROM used for Windows is written in Unicode. * Postscript interpreters should handle the Adobe Standard character set. * Printer filters or drivers for HP printers should handle the Roman-8 character set if using the internal fonts. 5.2.5. Columns -------------- On console displays, each character occupies one column. Printed text can be equally spaced (one column per character) or proportionally spaced (a character can occupy fractionally more or less than a column, depending on its shape). Note: Even when using Traditional Sorting, ch and ll occupy two columns. See the comment on Traditional sorting in Section 5.2.10.1, `Sorting'. 5.2.6. Writing Direction ------------------------ Spanish is normally written in left to right lines arranged from top to bottom of the page. For artistic purposes it might be written in top to bottom columns arranged left to right within the page. This columnar arrangement would be expected only in graphic and charting programs (e.g., a drawing program, a spreadsheet graph or a page layout program for composing brochures) but regular text editors wouldn't be expected to implement this style. 5.2.7. Layout of Characters --------------------------- In the Spanish language, words are separated by spaces and a line can be broken at a space, a punctuation sign or a hyphenated word. There are several sets of paired characters in Spanish. Unlike English, question marks and exclamation signs are also paired. Other paired characters are the same as English (parenthesis, square brackets, and so forth). Opening characters shouldn't appear at the end of a line. Closing characters and punctuation signs such as period and comma shouldn't appear at the beginning of a line. Words can be broken at a syllabus and hyphenated. Unlike English, syllabi in Spanish end in a vowel more often than in a consonant. Syllabi that end in a consonant letter are typically at the end of a word or followed by a syllabus that starts with another consonant. Anyway, the rules are not completely consistent and a hyphenation dictionary has to be used. 5.2.8. LANG variable -------------------- For Bash set meta-flag on # keep all 8 bits for keyboard input set output-meta on # keep all 8 bits for terminal output set convert-meta off # don't convert escape sequences export LC_CTYPE=ISO_8859_1 For Tcsh setenv LANG C setenv LC_CTYPE "iso_8859_1" 5.2.9. Input from Keyboard -------------------------- For the Spanish keyboard to work correctly, you need the command `loadkeys /usr/lib/kbd/keytables/es.map' in the corresponding startup (rc) file. Most of the Spanish characters are input from the keyboard with a single stroke. A two-key combination is used for accent and dieresis marks above vowels. Traditional typewriter machines used a 'dead key' system with keys that would strike the paper without advancing the carriage to the next character. Typing on a computer keyboard simulates this behavior, typing the accent or dieresis key does not produce any visible output until a vowel is typed afterwards. Usually if the accent or dieresis key is followed by a consonant, the accent key is ignored. Accented or dieresis characters cannot be used for shortcut keys for selecting options. The words for Yes and No are Si' (the character next to S is 'i' with acute accent) and No. We would commonly use the S and N keys for a Si'/No choice. Spanish keyboards usually allow for typing not only the Spanish accent signs, but also the accent signs in French and other languages (grave accent, circumflex accent, umlaut on letters other than the u). Other character that is typically available is the cedilla C (that looks like a C with a comma underneath, used for Catalan, Portuguese and French words, for example). There is a Latin-American keyboard layout that does not contain the grave accent and the cedilla C. 5.2.10. More Detailed Discussions --------------------------------- 5.2.10.1. Sorting ----------------- Traditional Spanish considered the combinations CH and LL individual single letters. For usage in computers, this required an additional effort for sorting and character counting algorithms. It was decided that the savings in not requiring special algorithms was significant enough and that it would be acceptable to treat them as 2 separate letters. Some software that already had incorporated the special sorting algorithms now allows for choosing between 'Traditional Spanish Sort' and 'Modern Spanish Sort'. Accents and dieresis are ignored for sorting purposes. The only exception is the rare case where two words are exactly the same and the accent is the only difference, the word with the unaccented character should be sorted first. E.g., camio'n (c-a-m-i-o with acute accent-n), camionero, este, e'ste (e with acute accent-s-t-e). The n~ (n with tilde) is always sorted after the n and before the l. It cannot be intermixed with the n. 5.2.10.2. Number format, date and currency symbols -------------------------------------------------- The use of the dot and the comma as a thousands separator and for decimal places is usually the opposite of US English. E.g., 1.000,00 instead of 1,000.00. Some Spanish-speaking countries, notably Mexico, follow the same standards as the US. It is desirable that programs can handle both forms as an independent setting. The usual date format is DD-MM-YYYY rather than MM-DD-YYYY, but again this depends on the specific country. It is desirable to have the date format as a configurable parameter. The currency symbol can be prepended or appended to the number and it can be one or several characters long. E.g., 100 PTA for Spanish pesetas or N$ 100 for Mexican pesos. It is desirable that the symbol and position can be individually defined and to allow for currency symbols longer than 1-character. 5.2.10.3. Varieties of Spanish ------------------------------ Spanish is spoken by a tremendous variety of people. Academics through the different Spanish-speaking countries realized that this could lead to a dismemberment of the language and founded the Academy of the Spanish Language. This academy has branches in most of the Spanish-speaking countries, there is a Royal Academy of the Spanish Language of Spain, an Academy of the Spanish Language of Mexico, et cetera. The members of this Academy study the local evolution of the languages in each country. They meet together to maintain a body of knowledge of what should be considered the Standard Spanish Language and what should be considered local or regional terms and slang terms. In most cases, software can use terms that are within the Standard set by the Academy. When new terms appear (e.g., when a new product is created that has no previous name in the Spanish language) each region typically starts using a new word. When there is one or two terms that become the de-facto standard, the Academy would incorporate the new term into the Standard. This is a very slow process and there will be temporary usages in different regions within the Spanish-speaking worlds that conflict with each other. Some people speak about Spain-Spanish and American-Spanish but most of the time it doesn't really make sense to make this distinction. First of all, even within America, there are differences between the local varieties that may be greater than the differences with Spain itself. E.g., Spanish as spoken in Mexico, Colombia and Argentina may have between them as much differences as each of them when compared to how it is spoken in Spain. A computer user in Ecuador may feel more comfortable overall with the terms used in Spain than with the terms used in Mexico (and of course, most comfortable with the terms used in Ecuador itself!). The options are to either produce one Spanish version of a software product that is an acceptable compromise (maybe not perfect) for all Spanish-speaking countries or to produce multiple versions to account for all the regional variations. A plea to all the people who are localizing software into Spanish: Let's use our efforts judiciously and create one Spanish version and not many. Let's strive for a version that conforms to the Standards and that can be as widely accepted as possible for the areas not covered by the Standards. Wouldn't you rather have a new product translated, instead of two versions of a product where one matches your local variety of the language? 5.3. Languages with Cyrillic script ----------------------------------- Section written by Alexander Voropay . First of all, there are a lot of languages with Cyrillic script. Slavic languages : Russian (ru), Ukrainian (uk), Belarussian (be), Bulgarian (bg), Serbian (sr), and Macedonian (mk). Another Slavic languages (Polish(pl), Czech(cz), Croatian(hr)) uses Latin script : mainly ISO-8859-2 (Central-European). During USSR time some non-slavic languages got own alpabets, based on modifyed cyrillic characters. Azerbaijani (az), Turkmen (tk), Kurdish (ku), Uzbek (uz), Kazakh (kk), Kirghiz (ky), Tajik (tg) and Mongolian (mn) Komi (kv) e.t.c. * http://www.peoples.org.ru/eng_index.html * http://www-hep.fzu.cz/~piska/ * http://www.srpsko-pismo.org/ * http://www.hr/hrvatska/language/CroLang.html * http://ftp.fi.muni.cz/pub/localization/charsets/cs-encodings-faq UNICODE has rich Cyrillic section. Ufortunately, there are a lot of 8-bit Cyrillic Charsets. There is no one universal 8-bit Cyrillic charset, because, for example, there are about 260 Cyrillic characters in Adobe Glyph List (http://partners.adobe.com/asn/developer/PDFS/TN/5013.Cyrillic_Font_Spec.pdf). The overview "The Cyrillic Charset Soup (http://czyborra.com/charsets/cyrillic.html)". The main problem with Russian : there are at least six live Charsets: * KOI8-R * Windows-1251 * CP-866 * ISO-8859-5 * MAC-CYRILLIC * ISO-IR-111 So, Russian computers really live in "Charset mix", like Japanese : Shift-JIS, ISO2022-JP, EUC-JP. You can get e-mail in any charset, so your Mail Agent should understand all this charsets. Takasiganai. In POSIX environment you should setup FULL locale name (with .Charset field) : LANG=ru_RU.KOI8-R LANG=ru_RU.ISO_8859-5 LANG=ru_RU.CP1251 e.t.c. for proper sorting, character classification and for readable messages. Any form of abbreviations ("`ru'", "`ru_RU'" e.t.c.) are sourse of misunderstanding. I hope, Unicode `LANG=ru_RU.UTF-8' will save us in near future... ------------------------------------------------------------------------------- 6. LOCALE technology -------------------- _LOCALE_ is a basic concept introduced into _ISO C_ (ISO/IEC 9899:1990). The standard is expanded in 1995 (ISO 9899:1990 Amendment 1:1995). In LOCALE model, the behaviors of some C functions are dependent on LOCALE environment. LOCALE environment is divided into a few categories and each of these categories can be set independently using `setlocale()'. _POSIX_ also determines some standards around i18n. Almost of POSIX and ISO C standards are included in _XPG4_ (X/Open Portability Guide) standard and all of them are included in XPG5 standard. Note that _XPG5_ is included in UNIX specifications version 2. Thus support of XPG5 is mandatory to obtain Unix brand. In other words, all versions of Unix operating systems support XPG5. The merit of using locale technology over hard-coding of Unicode is: * The software can be written encoding-independent way. This means that this software can support all encodings which the OS supports, including 7bit, 8bit, multibyte, stateful, and stateless encodings such as ASCII, ISO 8859-*, EUC-*, ISO 2022-*, Big5, VISCII, TIS 620, UTF-*, and so on. * The software will provides a common unified method to configure locale and encoding. This benefits users. Otherwise, users will have to remember the method to enable UTF-8 mode for each software. Some softwares need `-u8' switch, other need X resource setting, other need `.foobarrc' file, other need a special environmental variable, other use UTF-8 for default. It is nonsense! * The advancement of the OS means the advancement of the software. Thus, you can use new locale without recompiling your software. You can read the Unicode support in the Solaris Operating Environment (http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT) whitepapaer and understand the merit of this model. Bruno Haible's Unicode HOWTO (ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html) also recommends this model. 6.1. Locale Categories and `setlocale()' ---------------------------------------- In LOCALE model, the behaviors of some C functions are dependent on LOCALE environment. LOCALE environment is divided into six categories and each of these categories can be set independently using `setlocale()'. The followings are the six categories: _LC_CTYPE_ Category related to encodings. Characters which are encoded by LC_CTYPE-dependent encoding is called _multibyte characters_. Note that multibyte character doesn't need to be multibyte. LC_CTYPE-dependent functions are: character testing functions such as `islower()' and so on, multibyte character functions such as `mblen()' and so on, multibyte string functions such as `mbstowcs()' and so on, and so on. _LC_COLLATE_ Category related to sorting. `strcoll()' and so on are LC_COLLATE-dependent. _LC_MESSAGES_ Category related to the language for messages the software outputs. This category is used for `gettext'. _LC_MONETARY_ Category related to format to show monetary numbers, for example, currency mark, comma or period, columns, and so on. `localeconv()' is the only function which is LC_MONETARY-dependent. _LC_NUMERIC_ Category related to format to show general numbers, for example, character for decimal point. Formatted I/O functions such as `printf()', string conversion functions such as `atof()', and so on are LC_NUMERIC-dependent. _LC_TIME_ Category related to format to show time and date, such as name of months and weeks, order of date, month, and year, and so on. `strftime()' and so on are LC_TIME-dependent. `setlocale()' is a function to set LOCALE. Usage is char *`setlocale('int _category_, const char *_locale_`);'. Header file of `locale.h' is needed for prototype declaration and definition of macros for category names. For example, `setlocale(LC_TIME, "de_DE");'. For _category_, the following macros can be used: LC_CTYPE, LC_COLLATE, LC_MONETARY, LC_NUMERIC, LC_TIME, and LC_ALL. For _locale_, specific locale name, `NULL', or `""' can be specified. Giving `NULL' for _locale_ will return the current value of the specified locale category. Otherwise, `setlocale()' returns the newly set locale name, or `NULL' for error. Given `""' for _locale_, `setlocale()' will determine the locale name in the following manner: * At first, consult `LC_ALL' environmental variable. * If `LC_ALL' is not available, consult environmental variable same as the name of the locale category. For example, `LC_COLLATE'. * If none of them are available, consult `LANG' environmental variable. This is why a user is expected to set `LANG' variable. In other words, all what a user has to do is to set `LANG' variable so that all locale-compliant softwares work well for desired way. Thus, I recommend strongly to call `setlocale(LC_ALL, "");' at the first of your softwares, if the softwares are to be international. 6.2. Locale Names ----------------- We can specify locale names for these six locale categories. Then, which name should we specify? The syntax to build a locale name is determined as follows: language[_territory][.codeset][@modifier] where _language_ is two lowercase alphabets described in ISO639, such as `en' for English, `eo' for Esperanto, and `zh' for Chinese, _territory_ is two uppercase alphabets described in ISO3166, such as `GB' for United Kingdom, `KR' for Republic of Korea (South Korea), `CN' for China. There are no standard for _codeset_ and _modifier_. GNU libc uses `ISO-8859-1', `ISO-8859-13', `eucJP', `SJIS', `UTF8', and so on for _codeset_, and `euro' for _modifier_. However, it is depend on the system which locale names are valid. In other words, you have to install _locale database_ for locale you want to use. Type `locale -a' to display all supported locale names on the system. Note that locale names of `"C"' and `"POSIX"' are determined for the names for default behavior. For example, when your software need to parse the output of `date(1)', you'd better call `setlocale(LC_TIME, "C");' before invocation of `date(1)'. 6.3. Multibyte Characters and Wide Characters --------------------------------------------- Now we will concentrate on LC_CTYPE, which is the most important category in six locale categories. Many encodings such as ASCII, ISO 8859-*, KOI8-R, EUC-*, ISO 2022-*, TIS 620, UTF-8, and so on are used widely in the world. It is inefficient and a cause of bugs, even not impossible, for every softwares to implement all these encodings. Fortunately, we can use LOCALE technology to solve this problem. [1] _Multibyte characters_ is a term to call characters encoded in locale-specific encoding. It is nothing special. It is mere a word to call our daily encodings. In ISO 8859-1 locale, ISO 8859-1 is multibyte character. In EUC-JP locale, EUC-JP is multibyte character. In UTF-8 locale, UTF-8 is multibyte character. In short, multibyte character is defined by `LC_CTYPE' locale category. Multibyte characters is used when your software inputs or outputs text data from/to everywhere out of your software, for example, standard input/output, display, keyboard, file, and so on, as you are doing everyday. [2] You can handle multibyte characters using ordinal `char' or `unsigned char' types and ordinal character- and string-oriented functions. It is just like you used to do for ASCII and 8bit encodings. Then why we call it with a special term of _multibyte character_? The answer is, ISO C specifies a set of functions which can handle multibyte characters properly. On the other hand, it is obvious that usual C functions such as `strlen()' cannot handle multibyte characters properly. Then what is these functions which can handle multibyte characters properly? Please wait a minute. Multibyte character may be stateful or stateless and multibyte or non-multibyte, since it includes all encodings ever used and will be used on the earth. Thus it is not convenient for internal processing. It needs complex algorithm even for, for example, character extraction from a string, addition and division of a string, or counting of number of character in a string. Thus, _wide characters_ should be used for internal processing. And, the main part of these C functions which can handle multibyte characters are functions for interconversion between multibyte characters and wide characters. These functions are introduced later. Note that you may be able to do without these functions, since ISO C supplies I/O functions with conversion. Wide character is defined in ISO C * that all characters are expressed in fixed width of bits. * that it is stateless, i.e., it doesn't have shift states. There are two types for wide characters: `wchar_t' and `wint_t'. `wchar_t' is a type which can contain one wide character. It is just like 'char' type can be used for contain one character. `wint_t' can contain one wide character or `WEOF', an substitution of `EOF'. A string of wide characters is achieved by an array of `wchar_t', just like a string of characters is achieved by an array of `char'. There are functions for `wchar_t', substitute for functions for `char'. * `strcat()', `strncat()' -> `wcscat()', `wcsncat()' * `strcpy()', `strncpy()' -> `wcscpy()', `wcsncpy()' * `strcmp()', `strncmp()' -> `wcscmp()', `wcsncmp()' * `strcasecmp()', `strncasecmp()' -> `wcscasecmp()', `wcsncasecmp()' * `strcoll()', `strxfrm()' -> `wcscoll()', `wcsxfrm()' * `strchr()', `strrchr()' -> `wcschr()', `wcsrchr()' * `strstr()', `strpbrk()' -> `wcsstr()', `wcspbrk()' * `strtok()', `strspn()', `strcspn()' -> `wcstok()', `wcsspn()', `wcscspn()' * `strtol()', `strtoul()', `strtod()' -> `wcstol()', `wcstoul()', `wcstod()' * `strftime()' -> `wcsftime()' * `strlen()' -> `wcslen()' * `toupper()', `tolower()' -> `towupper()', `towlower()' * `isalnum()', `isalpha()', `isblank()', `iscntrl()', `isdigit()', `isgraph()', `islower()', `isprint()', `ispunct()', `isspace()', `isupper()', `isxdigit()' -> `iswalnum()', `iswalpha()', `iswblank()', `iswcntrl()', `iswdigit()', `iswgraph()', `iswlower()', `iswprint()', `iswpunct()', `iswspace()', `iswupper()', `iswxdigit()' (`isascii()' doesn't have its wide character version). * `memset()', `memcpy()', `memmove', `memmove()', `memchr()' -> `wmemset()', `wmemcpy()', `wmemmove', `wmemmove()', `wmemchr()' There are additional functions for `wchar_t'. * `wcwidth()', `wcswidth()' * `wctrans()', `towctrans()' You cannot assume anything on the concrete value of `wchar_t', besides `0x21' - `0x7e' are identical to ASCI