Developer's Guide to Internationalization
只搜寻这本书
以 PDF 格式下载本书

Support for Internationalization

3

This chapter describes features of Solaris that provide the foundation for internationalization support. Chapter 4 gives examples of international coding practices, and Chapter 5 discusses window system specifics.
Here are some of the ways that the SunOS operating system supports international software applications:
  • Data paths are 8-bit clean in order to support ISO 8859 code sets, and so that multi-byte characters can survive intact.
  • Keyboard drivers and mapping tables are provided for a variety of code sets. This allows software to cope with many European and Asian languages.
  • Printers, modems, and terminals are supported that can handle ASCII, ISO Latin-1, and EUC code sets.
  • System locales are included for French, German, Italian, Swedish, and Japanese, offering level 2 internationalization for these locales.
  • Standard C library routines provide support for writing software that can be easily localized.
  • Wide character library routines provide programming support for Asian language applications.
  • Applications can use either the X/Open or the gettext() message system for access to text that needs translation.
Note that only dynamically linked libraries provide international support. Many of the features above will not work for statically linked programs.

Keyboards and Peripherals

Keyboards

The Type-5 and Type-4 keyboards are currently available in 18 versions: 15 for Roman alphabets and three for Asian languages:
Table 3-1
Belgium/FranceCanadaCanada (French)
DenmarkGermanyItaly
NetherlandsNorwayPortugal
SpainSweden/FinlandSwitzerland (French)
Switzerland (German)United KingdomUnited States
Japan Korea Taiwan
International keyboards are normally delivered with country kits, but are also available separately.
The PC-AT101 and PC-AT102 keyboards are currently available in:
Table 3-2
Belgium/FranceCanadaDenmark
GermanyItalyNetherlands
NorwayPortugalSpain
Sweden/FinlandSwitzerland (French)Switzerland (German)
United KingdomUnited States
The lower half of the ISO Latin-1 code set contains all characters from the ASCII code set, while the upper half includes accented and special characters, shown in the figure below.
Figure 3-1 Upper Half of ISO 8859-1

Internal bitmap(502x165)

The PROM monitor and OpenWindows 3.3 provide fonts for western Europe, standardized as ISO 8859-1, also called ISO Latin-1. These ISO fonts use the full 8-bit address space of a byte. The system is supposed to come up in 8-bit mode by default, but in case it does not, simply type stty cs8 -istrip in a shell or command window.
When SunOS boots on a SPARC system, it automatically recognizes the keyboard type. If you plug an alternate keyboard into a running system, SPARC hardware generates an interrupt and returns you to the PROM monitor. After the PROM monitor's > or ok prompt, simply type c or go for "continue". Then choose ''Refresh'' from the OpenWindows Workspace Utilities menu, and execute the loadkeys command in a shell or command window.
When SunOS boots on an x86 system, the system is supposed to come up in 8-bit mode by default. If it does not, simply type stty pass8 in a shell or command window.

Native Language Keyboards

The SPARC Type-5 and Type-4 keyboards generate 8-bit characters, or events. System translation tables generate appropriate character codes based on these events. The SunOS translation tables are in /usr/share/lib/keytables. OpenWindows keyboard tables are in $OPENWINHOME/etc/keytables.
The PC-AT101 and PC-AT102 keyboards generate 8-bit characters, or events. System translation tables generate appropriate character codes based on these events. The SunOS translation tables are in /usr/share/lib/keyboards. OpenWindows keyboard tables are in $OPENWINHOME/etc/keytables.
If at all possible, use the operating system or window system's keyboard translation tables. Applications that read events directly from the keyboard must perform their own key mappings, which isn't easy. Moreover, such applications need to be revised every time new keyboards are devised.

Generating Characters Not on a U.S. Keyboard

Although non-English characters like the German ä or the French ê are not present on a keyboard designed for use in American English, most of these characters can be generated. This allows users to write French letters on American systems, for example. There are three ways to generate characters for which there are no keycaps (explicit symbols on the keyboard):
  • Deadkeys (x86 systems only)
  • Compose sequences (SPARC and x86 systems)
  • The decimal representation of the character (x86 systems only)

Deadkeys (x86)

The deadkey was invented by typewriter manufacturers. For example, imagine you need the French character ê. A French typewriter does not have a key for this character, but it has keys for both e and ^. When the key ^ is pressed, a circumflex is printed but the typewriter carriage does not move. When the e key is then pressed, the letter "e" is printed on the same spot as the circumflex and an ê is formed. This technique works very similarly on a terminal. The only difference is that when ^ is pressed, nothing happens until e is pressed, after which the character ê appears on the screen.
In text mode, a utility that can be used to assign deadkeys, pcmapkeys, is supplied. This utility is used to do everything discussed in this section. To define ^ as a deadkey and try the other examples listed below, type the command:
pcmapkeys -f /usr/share/lib/keyboards/dead/circumflex

In OpenWindows, the utility Xmodmap may be used to remap keys. See the Xmodmap (1) manual page for additional information. Now when you press ^, nothing appears on the screen. When an e is typed next, the letter ê appears. To use the ^ character alone, press ^ first and then the spacebar. If a sequence of two characters is typed that does not make sense, no character is sent to the application that is currently being used, and the machine beeps to indicate that an erroneous combination was typed.

Using the Compose Key (SPARC and x86)

Compose Key on SPARC The regular SPARC Type-5 or Type-4 keyboard can produce all characters in the standard ISO 8859-1 (Latin-1) code set by means of the Compose key. Such characters are typically composite characters that include diacritical marks. To produce a composite character on the US-English keyboard, first press the Compose key. Next, press the key for the desired diacritical mark, and then the key for the desired alphabetical character. You may type the diacritical mark and the alphabetical character in either order.
For example, to produce à press the Compose key, then type a and '(order doesn't matter). When testing software, make sure to try all these combinations on your keyboard. If your software manages the keyboard directly, you should also try all special keys on the 13 European SPARC keyboards and perhaps the Asian SPARC keyboards as well.
Compose Key on x86 On x86 systems, the default COMPOSE key sequence for Solaris for x86 is CTRL SHIFT F1. (Many MS-DOS(R) (DOS) users will be familiar with it.) When in COMPOSE mode, the system expects two more characters to be typed by the user to generate a character. Press CTRL SHIFT F1 followed by n ~ to produce
the Spanish ñ (the n in mañana) on the screen. If you press the COMPOSE key sequence followed by pressing ! twice, an inverted exclamation sign appears on the screen.
In text mode, both the value of the COMPOSE key and the list of COMPOSE key sequences and the characters they generate can be specified in a file that is then processed by the pcmapkeys command (see pcmapkeys (1) ). In text mode, the following tables are only valid for the ISO-8859-1 codeset and not for the optional IBM DOS 437 codeset. In OpenWindows, these tables are valid because OpenWindows only supports the ISO-8859-1 codeset.
Compose Key Sequences Here is a table shows how to produce special ISO Latin-1 characters using Compose key sequences. .
Table 3-3
Compose KeySequenceResultDescription
spacespace
no-break space
!!¡inverted exclamation
c/¢cents
l-£pounds sterling
ox¤currency symbol
y-¥yen
|||broken bar
so§section
""¨umlaut/diaeresis
co(C)copyright
-a.feminine ordinal
<<«left guillemet
-|¬not sign
---soft hyphen
ro(R)registered
Table 3-3
Compose KeySequenceResultDescription
^-¯macron
^0°degree
+-.plus-minus
^22superscript 2
^33superscript 3
\\´prime/acute accent
/u.mu/micro
P!pilcro/paragraph
,,¸cedilla
^..middle dot
^11superscript 1
_o.masculine ordinal
>>»right guillemet
141/4quarter
121/2half
343/4three quarters
??¿inverted question
A`ÀA grave
A'ÁA acute
A^ÂA circumflex
A~ÃA tilde
A"ÄA umlaut
A*ÅA angstrom
AEÆAE ligature
C,ÇC cedilla
E`ÈE grave
Table 3-3
Compose KeySequenceResultDescription
E'ÉE acute
E^ÊE circumflex
E"ËE umlaut
I`ÌI grave
I'ÍI acute
I^ÎI circumflex
I"ÏI umlaut
D-D-Eth
N~ÑN tilde
O`ÒO grave
O'ÓO acute
O^ÔO circumflex
O~ÕO tilde
O"ÖO umlaut
xx.multiply
O/ØO slash
U`ÙU grave
U'ÚU acute
U^ÛU circumflex
U"ÜU umlaut
Y'´YY acute
TH.IThorn
ssßess zed/digraph s
a`àa grave
a'áa acute
a^âa circumflex
Table 3-3
Compose KeySequenceResultDescription
a~ãa tilde
a"äa umlaut
a*åa angstrom
aeæae ligature
c,çc cedilla
e`èe grave
e'ée acute
e^êe circumflex
e"ëe umlaut
i`ìi grave
i'íi acute
i^îi circumflex
i"ïi umlaut
d--.eth
n~ñn tilde
o`òo grave
o'óo acute
o^ôo circumflex
o~õo tilde
o"öo umlaut
-:.divide
o/øo slash
u`ùu grave
u'úu acute
u^ûu circumflex
u"üu umlaut
Table 3-3
Compose KeySequenceResultDescription
y'y acute
th.Ithorn
y"yy umlaut

Decimal Representation

A third method of generating characters is using their decimal representation. Every character corresponds to a unique number. Up to 256 different characters can be used (although some terminals only support 128). When the COMPOSE key is used, followed by three digits, the character that is internally represented by the three-digit number (in decimal) is generated. This feature is also derived from the DOS system. Press the COMPOSE key sequence, followed by 065, and an A appears on the screen. 65 is the decimal value used by computers to store the uppercase letter A. Press the COMPOSE key sequence followed by 136 and the letter ê appears. If you type:
pcmapkeys -d

all deadkeys and compose sequences are disabled.

Using the Floating Accent Keys

On some keyboards, certain keys appear with an empty box (·) underneath the diacritical mark. These are referred to as floating accent keys. They allow you to type in a composite character without using the Compose key. Type the floating accent key first, followed by the key for the letter to be accented.

Modems

Modems set to 8-bit no-parity mode will work with 8-bit data.

Dumb Terminals

Dumb terminals set to 8-bit space-parity mode will work with 8-bit data, although they are unlikely to display the proper characters.

Printers

Support in the SunOS system for native language printing includes:
  • Transmission of 8-bit characters by lp. The serial line must also be 8-bit clean, and the printer must support the ISO Latin-1 character set, for the characters to come out properly.
  • The SunOS lp subsystem will spool PostScript(R) files. It can also translate the following formats to PostScript for spooling and printing:

    · troff to PostScript

    · TeX to PostScript

    · regular text to PostScript

    · Tektronix 4014 to PostScript

    · Diablo 630 to PostScript

    · plot(5) to PostScript

All these conversions are 8-bit clean. Be aware that standard paper size around the world varies widely. Internationalized applications do not assume any particular set of page sizes. SunOS provides no support for tracking the output page size; this is the responsibility of the application program itself.

Character and Code Sets

The ISO Latin-1 character set is used to represent European characters sets. ISO Latin-1 uses eight bits (one byte) to represent each character, allowing for 255 characters. It is compatible with the 7-bit ASCII code set in that all ASCII characters have identical encodings in the ISO Latin-1, when the most significant bit in ISO Latin-1 is set to 0. ISO Latin-1 can be thought of as a superset of ASCII for purposes of text representation.
East Asian characters, however, cannot fit into a single byte. Consequently Chinese, Japanese, and Korean all require multi-byte code sets for language processing. SunOS uses the EUC encoding scheme to represent multi-byte characters. With EUC, one to four bytes may be used to store a character.
EUC characters are not necessarily convenient to process by standard functions because of their variable length nature, so SunOS provides functions to convert EUC characters to wide characters. In SunOS, wide characters are four bytes long, and can be processed using wide character library routines. The next sections examine EUC and wide characters in detail.

Extended UNIX Code (EUC)

SunOS has adopted the EUC from USL's Multi-National Language Supplement (MNLS). EUC is used primarily for storing data in files.
EUC is comprised of four code sets, three of which may be multi-byte, and two of which must be announced by 8-bit control codes known as single-shift characters. EUC's primary code set (code set 0) is used for ASCII. The three supplementary code sets (code sets 1, 2, and 3) can be assigned to different code sets by the locale administrator.
Code set 0 is single byte, with the most significant bit set to zero. The supplementary code sets can be single- or multi-byte, with the most significant bit set to one. Code sets 2 and 3 have a preceding single-shift character, known as SS2 and SS3 respectively, where SS2 = 0x8E (10001110) and SS3 = 0x8F (10001111). There is no SS1.
Differentiating between code sets is done as follows: If the high bit is 0, the code set is ASCII. If the high bit is 1, the byte is checked for SS2 or SS3 to determine code set. The length (in bytes) of characters from that code set is retrieved from the LC_CTYPE locale database governing character classification associated with the current locale.
Code setEUC Representation
slot 00xxxxxxx
slot 11xxxxxxx or
1xxxxxxx 1xxxxxxx or
1xxxxxxx 1xxxxxxx 1xxxxxxx
slot 2SS2 1xxxxxxx or
SS2 1xxxxxxx 1xxxxxxx or
SS2 1xxxxxxx 1xxxxxxx 1xxxxxxx
slot 3SS3 1xxxxxxx or
SS3 1xxxxxxx 1xxxxxxx or
SS3 1xxxxxxx 1xxxxxxx 1xxxxxxx
Whether code sets 1, 2, and 3 are single-byte, double-byte, or triple-byte depends on the locale. Code set 1 could be used to represent ISO Latin-1, but this is not the usual practice in East Asian locales.
EUC divides the code set space into graphic and control characters. Graphic characters are those that can be displayed. Special characters include control characters, unassigned characters, and the space and delete characters. Control characters are characters other than graphic characters, whose occurrence may initiate, modify, or stop a control operation. The following table indicates the single-byte special characters.
Special CharacterEUC Representation
Space00100000
Delete01111111
Control codes (Primary)000xxxxx
Control codes (Supplementary)100xxxxx
SS2 and SS3 are examples of supplementary control codes.

Wide Characters

EUC is intended primarily for external data storage, and its encoding schemes provide reasonably compact representations for data storage. However, EUC is not very convenient for internal processing--its variable length nature complicates constructing homogenous character arrays, for example. To assist convenient internal processing, SunOS provides a wide character format, plus a collection of library functions for operating on wide characters. Additional functions are available to convert from EUC format to wide characters and from wide characters back to EUC format.
Wide characters are the ANSI C data type wchar_t, defined in SunOS as typedef long. EUC code sets with one, two, or three bytes get mapped to wide characters as shown below. Four bytes are enough to represent the entire Chinese character set defined by the Chinese National Standard CNS 11643-86, with space left over for user-defined characters.
Table 3-4
Code setEUC RepresentationWide Character Representation
00xxxxxxx00000000 00000000 00000000 0xxxxxxx
11xxxxxxx 1yyyyyyy00110000 00000000 00xxxxxx xyyyyyyy
2SS2 1xxxxxxx 1yyyyyyy 1zzzzzzz00010000 000xxxxx xyyyyyyy yzzzzzzz
3SS3 1xxxxxxx 1yyyyyyy 1zzzzzzz00100000 000xxxxx xyyyyyyy yzzzzzzz
Wide characters provide a standard character size, and are useful for indexing, interprocess communication, memory management, and other tasks that use character counts and known array sizes. Wide characters are stateless and unambiguous within a given locale.
Note that with the SunOS model, the single byte ISO Latin-1 character is represented in wide character form as follows:

Imported image(501x30)

This is the only way in which a single byte character can lose its sense of ''single-bytedness.''

Multi-byte Library Routines

SunOS provides four library routines to convert characters and strings from EUC representation to wide character representation and back again. SunOS also provides a set of library routines to perform standard operations on wide characters.
Use mbtowc() to convert EUC representation to a wide character, and wctomb() to convert a wide character to EUC representation. For strings (arrays of characters), use mbstowcs() to convert EUC representation strings to wide character strings and wcstombs() to convert wide character strings to EUC representation strings. All these are in libc.
SunOS provides library routines in libw to replace or supplement character and string routines in libc. Compile multi-byte programs using the -lw option to the linker.
A wide character standard I/O package is available; its routines have the letter w in front of c (for character-based routines) or s (for string-based routines). There is the wsprintf() function for wide character formatting, and the wsscanf() function for wide character input interpretation. All the is*() functions are duplicated by isw*() functions for wide characters, and all the str*() functions by ws*() functions for wide character string operations.
When making applications multi-byte capable, you should increase buffer size four-fold in order to preserve efficiency, since wchar_t is four bytes long.
Table 3-5
Library RoutineDescription
Locale Management
setlocale()set or query language or locale
nl_langinfo() Character Typeobtain various language or locale information
isalnum()is letter or digit
isalpha()is a letter
isascii()is 7-bit ASCII character
iscntrl()is control code
isdigit()is a digit
isgraph()is visible
islower()is lower-case
isprint()is printable
ispunct()is a punctuation mark
Table 3-5
Library RoutineDescription
isspace()is white space
isupper()is upper-case
isxdigit()is a hexadecimal digit
toascii()convert to ASCII
tolower()convert to lower-case
toupper() String Collationconvert to upper-case
strcoll()compare two strings
strxfrm() Date and Timetransform string for comparison
strftime() Formatted Outputconvert date and time to string
printf()print formatted string
fprintf()format string to file stream
sprintf() Formatted Inputformat string in memory
scanf()scan formatted string
fscanf()scan string from file stream
sscanf() Monetary Formatscan string in memory
localeconv() SunOS Messagingreturns structure containing monetary format
bindtextdomain( )associate path name with message domain
textdomain()open message catalog domain
gettext()get message from catalog
Table 3-5
Library RoutineDescription
dgettext() X/Open Messagingget message from catalog domain
catopen()open message catalog (X/Open)
catgets()get message from catalog (X/Open)
catclose() Regular Expressionsclose message catalog (X/Open)
regexpr(3G)

Multi-byte Handling

regular expression handler (is EUC multibyte-capable, but no wide character interface has been provided)
mblen()get length of multi-byte character
mbtowc()multi-byte to wide character
wctomb()wide character to multi-byte character
mbstowcs()multi-byte string to wide character string
wcstombs() Wide Characterswide character string to multi-byte string
wscat()concatenate wide char strings
wsncat()concatenate wide char strings to length n
wsdup()duplicate wide char string
wscmp()compare wide char strings
wsncmp()compare wide char strings to length n
wscpy()copy wide char strings
wsncpy()copy wide char strings to length n
wschr()find character in wide char string
wsrchr()find character in wide char string from right
wslen()get length of wide char string
wscol()return display width of wide char string
Table 3-5
Library RoutineDescription
wsspn()return span of one wide char string in another
wscspn()return span of one wide char string not in another
wspbrk()return pointer to one wide char string in another
wstok() Wide Formattingmove token through wide char string
wsprintf()generate wide char string according to format
wsscanf() Wide Numbersinterpret wide char string according to format
wstol()convert wide char string to long integer
wstod()
Wide Strings
convert wide char string to double precision
wscasecmp()compare wide char strings, ignore case differences
wsncasecmp()compare wide char strings to length n (ignore case)
wscoll()collate wide char strings
wsxfrm() Wide Standard I/Otransform wide char string for comparison
fgetwc()get multi-byte char from stream, convert to wide char
getwchar()get multi-byte char from stdin, convert to wide char
fgetws()get multi-byte string from stream, convert to wide char
getws()get multi-byte string from stdin, convert to wide char
fputwc()convert wide char to multi-byte char, put to stream
putwchar()convert wide char to multi-byte char, put to stdin
fputws()convert wide char to multi-byte string, put to stream
putws()convert wide char to multi-byte string, put to stdin
ungetwc()
Wide Ctype
push a wide char back into input stream
Table 3-5
Library RoutineDescription
iswalpha()is wide character letter
iswupper()is wide character upper-case
iswlower()is wide character lower-case
iswdigit()is wide character digit
iswxdigit()is wide character hex digit
iswalnum()is wide character alphanumeric
iswspace()is wide character white space
iswpunct()is wide character punctuation
iswprint()is wide character printable
iswgraph()is wide character graphic
iswcntrl()is wide character control
iswascii()is wide character ASCII
isphonogram()is wide character phonogram
isideogram()is wide character ideogram
isenglish()is wide char in English alphabet from sup code set
isnumber()is wide character digit from supplementary code set
isspecial()is special wide character from sup code set
towupper()convert wide character to upper-case
towlower() Codeset Infoconvert wide character to lower-case
getwidth()get code set information on EUC and screen width
euclen()get EUC byte length
euccol()get EUC character display width
eucscol()get EUC string display width
csetlen()return number of bytes for an EUC code set
csetcol()return columns needed to display EUC code set

Naming Rules

In SunOS, the following objects must be composed of ASCII characters.
  • User name, group name, and passwords
  • System name
  • Names of printers and special devices
  • Names of terminals (/dev/tty*)
  • Process ID numbers
  • Message queues, semaphores, and shared memory labels
The following may be composed of ISO Latin-1 or EUC characters:
  • File names
  • Directory names
  • Command names
  • Shell variables and environment variable names
  • Mount points for file systems
  • NIS key names and domain names
The names of NFS shared files should be composed of ASCII characters. Although files and directories may have names and contents composed of characters from supplementary code sets, using only the ASCII code set allows NFS mounting across any machine, regardless of localization.

What Is a Locale?

The key concept for application programs is that of a program's locale. The locale is an explicit model and definition of a native-language environment. The notion of a locale is explicitly defined and included in the library definitions of the ANSI C Language standard.
The locale consists of a number of categories for which there are language-dependent formatting or other specifications. A program's locale defines its code sets, date and time formatting conventions, monetary conventions, decimal formatting conventions, and collation order.
A locale name is comprised of language, territory, and possibly code set, although territory is dropped when not needed. Code set is usually assumed. For example, German is de, an abbreviation for Deutsch, while Swiss German is de_CH, CH being an abbreviation for Confoederatio Helvetica. See Appendix A for a list of accepted locale names.
Generally the locale name is specified by the LANG environment variable. Locale categories are subordinate to LANG, but may be set separately, in which case they override LANG. If LC_ALL is set, it overrides not only LANG, but all the separate locale categories as well.

Locale Categories

The locale categories are as follows:
LC_CTYPE
A directory whose files control the behavior of character handling functions.
The LC_CTYPE/ctype file specifies character types for <ctype.h>.

LC_TIME
A readable file that specifies date and time formats, including month names,
days of the week, and common full and abbreviated representations.

LC_MONETARY
A binary file that specifies monetary formats. Very few SunOS commands or
library routines actually use this database.

LC_NUMERIC
A tiny file that specifies the decimal separator (or radix character) and the
thousands separator.

LC_COLLATE A directory containing files that specify sorting order for a locale, and string conversions required to attain this ordering.
LC_MESSAGES A directory containing message catalogs (user message translations). This locale directory would be empty until a localization package containing system message translations is installed. Note that many application packages would have their own separate LC_MESSAGES directories.
All of these locale categories, with the exception of LC_MESSAGES, are defined in both the X/Open and ANSI C standards. LC_MESSAGES is Sun-specific.