Содержащиеся вНайти другие документыРесурсы поддержки | Загрузить это руководство в формате PDF (3243 КБ)
Chapter 2 General Internationalization FeaturesThis chapter discusses several internationalization features contained in the Solaris Operating System. The chapter covers the following topics. Support for Code Set IndependenceEUC is an abbreviation for Extended UNIX® Code. The Solaris Operating System supports non-EUC encodings such as PC-Kanji (better known as Shift_JIS) in Japan, Big5 in Taiwan, and GBK in the People's Republic of China. Because a large part of the computer market demands non-EUC codeset support, the current Solaris environment provides a solid framework to enable both EUC and non-EUC code set support. This support is called Code Set Independence, or CSI. The goal of CSI is to remove dependencies on specific code sets or encoding methods from Solaris Operating System libraries and commands. The CSI architecture enables the Solaris Operating System to support any UNIX file system safe encoding. CSI supports a number of new code sets, such as UTF-8, PC-Kanji, and Big5. CSI ApproachCode set independence enables application and platform software developers to keep their code independent of any encoding, such as UTF-8. CSI also provides the ability to adopt any new encoding without having to modify the source code. This architecture approach differs from JavaTM internationalization because applications do not have to be to be UTF-16–dependent. Many existing internationalized applications (for example, Motif) automatically inherit CSI support from the underlying system. These applications work in the new locales without modification. CSI is inherently independent from any code sets. However, the following assumptions about file code encodings (code sets) still apply to the current Solaris system:
CSI-enabled CommandsThis section lists the CSI-enabled commands in the current Solaris environment. The man page for each command includes an attribute section that indicates whether the command is CSI-enabled. All commands are in the /usr/bin directory, unless otherwise noted.
CSI-enabled LibrariesNearly all functions in libc (/usr/lib/libc.so) are CSI-enabled. However, the following functions in libc are not CSI-enabled and therefore are EUC-dependent functions:
In the current Solaris environment, libgen /usr/ccs/lib/libgen.a and libcurses /usr/ccs/lib/libcurses.a are internationalized but not CSI-enabled. Locale DatabaseThe locale database format and structure is private and subject to change in a future release. When you develop internationalized applications, you use the internationalization APIs in libc. These APIs are described in Internationalization APIs in libc, rather than linking to the locale database. Note – When you work in the Solaris environment, use the locale databases that are included with the current Solaris release. Do not use locales from previous Solaris versions. Process Code FormatThe process code format, which is also known as wide-character code format in the Solaris Operating System, is private and subject to change in a future release. Therefore, when you develop an international application, do not assume that the process code format is the same. Instead, use the internationalization APIs in libc described in Internationalization APIs in libc. Note – The process code for all Unicode locales is in UTF 32 representation. For more detail on UTF 32, refer to the Unicode Standard Annex #19: UTF 32 and Unicode Standard Annex #27: Unicode 3.1 from the Unicode Consortium or http://www.unicode.org/. Multibyte Support EnvironmentA multibyte character is a character that cannot be stored in a single byte, such as Chinese, Japanese, or Korean characters. These characters require 2, 3, or 4 bytes of storage. A more precise definition can be found in ISO/IEC 9899:1990 subclause 3.13. The Amendment 1 to ANSI C, which is also known as ISO/IEC 9899:1990, added new internationalization features, collectively known as the Multibyte Support Environment (MSE). Amendment 1 defines additional internationalization APIs for multibyte code sets with state and also for better wide-character handling support. The programming model enables these multibyte characters to be read in as logical units and stored internally as wide characters. These wide characters can be processed by the program as logical entities. Finally, these wide characters can be written out, undergoing appropriate translation, as logical units. This procedure is analogous to the way single-byte characters are read in, manipulated, and written out again. The MSE enables programs to handle multibyte characters using the same programming model that is used for single-byte characters. Dynamically Linked ApplicationsYou can link applications with the system libraries, such as libc, by using dynamic linking or static linking. Any application that requires internationalization features in the system libraries must be dynamically linked. If the application has been statically linked, the operation to set the locale to anything other than C and POSIX using the setlocale function will fail. Statically linked applications can operated only in C and POSIX locales. By default, the linker program tries to link the application dynamically. If the command-line options to the linker and the compiler include -Bstatic or -dn specifications, your application might be statically linked. You can check whether an existing application is dynamically linked using the /usr/bin/ldd command. For example, the response to the following command indicates that the /sbin/sh command is not a dynamically linked program:
The response to the following command indicates that the /usr/bin/ls command has been dynamically linked with two libraries, libc.so.1 and libdl.so.1.
Changed Interfaceslibw and libintl have moved to libc and are no longer in libw and libintl. The shared objects ensure runtime compatibility for existing applications and, together with the archives, provide compilation environment compatibility for building applications. However, you no longer must build applications against libw or libintl. The following list shows the stub entry points in libw:
The following list shows the stub entry points in libintl:
ctype MacrosCharacter classification and character transformation macros are defined in /usr/include/ctype.h. The current Solaris environment provides a set of ctype macros that support character classification and transformation semantics defined by XPG4. For all XPG4 and XPG4.2 applications to automatically access new macros, one of the following conditions must be met:
Because _XOPEN_SOURCE, _XOPEN_VERSION, and _XOPEN_SOURCE_EXTENDED bring in extra XPG4 related features in addition to new ctype macros, non-XPG4 or XPG4.2 applications should use __XPG4_CHAR_CLASS__. Corresponding ctype functions also exist. The current Solaris environment functions also support XPG4 semantics. Internationalization APIs in libcThe current Solaris environment offers two sets of APIs:
Wide-character codes are fixed-width units of logical entities. Therefore, you do not have to keep track of maintaining proper character boundaries when you are using multibyte characters. When a program takes input from a file, you can convert your file's multibyte data into wide-character process code directly with input functions like fscanf and fwscanf or by using conversion functions like mbtowc and mbsrtowcs after the input. To convert output data from wide-character format to multibyte character format, use output functions like fwprintf and fprintf or apply conversion functions like wctomb and wcsrtombs before the output. The tables in the remainder of this chapter describe the internationalization APIs included in the current Solaris system. The following table describes the messaging function APIs in libc. Table 2–1 Messaging Functions in libc
The following table describes the code conversion function APIs in libc. Table 2–2 Code Conversion in libc
The following table describes the regular expression APIs in libc. Table 2–3 Regular Expressions in libc
The following table describes the wide character function APIs in libc. Table 2–4 Wide Character Class in libc
The following table lists the modify and query locale in libc. Table 2–5 Modify and Query Locale in libc
The following table lists the query locale data in libc. Table 2–6 Query Locale Data in libc
The following table describes the character classification function APIs in libc. Table 2–7 Character Classification and Transliteration in libc
The following table describes the character collation function APIs in libc. Table 2–8 Character Collation in libc
The following table describes the monetary handling function APIs in libc. Table 2–9 Monetary Formatting in libc
The following table describes the date and time formatting in libc. Table 2–10 Date and Time Formatting in libc
The following table describes the multibyte handling function APIs in libc. Table 2–11 Multibyte Handling in libc
The following table describes the wide character and string handling in libc. Table 2–12 Wide Character and String Handling in libc
The following table describes the formatted wide-character input and output in libc. Table 2–13 Formatted Wide-character Input and Output in libc
This table describes the wide strings function APIs in libc. Table 2–14 Wide Stringslibc
The following table describes the wide-character input and output in libc. Table 2–15 Wide-Character Input and Output in libc
genmsg UtilityThe new genmsg utility can be used with the catgets() family of functions to create internationalized source message catalogs. The utility examines a source program file for calls to functions in catgets and builds a source message catalog from the information it finds. For example: % cat example.c ... /* NOTE: %s is a file name */ printf(catgets(catd, 5, 1, "%s cannot be opened.")); /* NOTE: "Read" is a past participle, not a present tense verb */ printf(catgets(catd, 5, 1, "Read")); ... % genmsg -c NOTE example.c The following file(s) have been created. new msg file = "example.c.msg" % cat example.c.msg $quote " $set 5 1 "%s cannot be opened" /* NOTE: %s is a file name */ 2 "Read" /* NOTE: "Read" is a past participle, not a present tense verb */ In the above example, genmsg is run on the source file example.c, which produces a source message catalog named example.c.msg. The -c option with the argument NOTE causes genmsg to include comments in the catalog. If a comment in the source program contains the string specified, the comment appears in the message catalog after the next string extracted from a call to catgets. You can use genmsg to number the messages in a message set automatically. For more information, see the genmsg(1) man page. To generate a formatted message catalog file, use the gencat(1) utility. For information on the message extraction utility for portable message files (.po files) and also on how to generate message object files (.mo files) from the .po files. User-Defined and User-Extensible Code ConversionsYou can create user-defined codeset converters using the geniconvtbl utility. This utility enables user-defined and user-customizable codeset conversions with a standard system utility and interface like iconv(1) and iconv(3C). This feature enhances the ability of an application to deal with incompatible data types, particularly data generated from proprietary or legacy applications. Modification to existing Solaris codeset conversions is also supported. Sample input source files for the utility are available in the /usr/lib/iconv/geniconvtbl/srcs/ directory. Once the user-defined code conversions are prepared and placed properly, users can use the code conversions from the iconv(1) utility and the iconv(3C) functions of both 32-bit and 64-bit Solaris Operating System. Internationalized Domain Name (IDN) SupportInternationalized Domain Name (IDN) enables the use of non-English native language names as host and domain names. To use non-English host and domain names, convert these names into ASCII Compatible Encoding (ACE) encoded names before sending the names to resolver routines as specified in RFC 3490. System administrators are also required to use ACE names in system files and applications where the system administration applications do not support the IDNs. See RFC 3490 Internationalizing Domain Names in Applications (IDNA). The APIs for the Internationalized Domain Name in libidnkit(3EXT) provide convenient conversions between UTF-8 or the application locale's codeset and ACE. If idn_decodename2(3EXT) is used, you can also specify an arbitrary codeset name as the codeset of the input argument. Figure 2–1 IDN to ACE Conversion
Figure 2–2 ACE to IDN Conversion
The following table shows bilateral iconv code conversions that you can use. Table 2–16 iconv Code Conversions
The ACE and the ACE-ALLOW-UNASSIGNED iconv code conversion names have the following meanings:
The following example shows a conversion from ACE to UTF-8 with input from the hostnames.txt file. Output goes to standard output. system% iconv -f ACE -t UTF-8 hostnames.txt The dedicated IDN conversion utility idnconv(1) provides IDN conversions with various options. The options control the conversion details. For information about IDN, the conversion routines, and iconv code conversions, see libidnkit(3LIB), idn_decodename(3EXT), idn_decodename2(3EXT), idn_encodename(3EXT), and iconv_en_US.UTF-8(5) man pages. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||