Contained Within
Find More Documentation
Featured Support Resources
| Download this book in PDF (1013 KB)
Glossary
- ANSI
-
American National Standards Institute. ANSI proposes standard
definitions for different computing languages. The most recent standard for
the C language, prepared by the ANSI C X3J11 Committee, includes library functions
for computing with multibyte characters for international usage, as well as
a new data type, wchar_t, for dealing with four-byte characters.
This standard is not completed, so it is referred to as the “proposed
ANSI C standard,” or ANSI C-X3J11.
- ASCII
-
American Standard Code for Information Interchange. ASCII
is a seven-bit code containing English upper and lowercase letters, punctuation,
numbers and control codes. The eighth bit in each byte is used by different
applications for parity checking, communication and message passing protocols,
compacting data, or other purposes. Applications that are intended to be internationalized
cannot use this bit and use multiple code sets or multibyte characters, and
utilities that handle multiple code sets or multibyte characters.
- category
-
In the Traditional Chinese Solaris documentation set, category
is related to localization. A category is a portion of a country's language
representation and cultural conventions. For instance, the date is often represented
in the U.S. as month, day, year; while in another country it might be day, month, year.
The date and time can be thought of as one category of a local language. Categories
also refer to the program categories, the environment variables that are related
to categories, and the ANSI localization tables for each category.
- character set
-
A set of elements used for the organization, control, or representation
of data. Character sets may be composed of alphabets, ideograms, or other
units. Character sets may contain other character sets, which causes unclear
boundaries. For example, the CNS 11643 character set contains English, Greek,
and Chinese character sets in addition to Chinese radicals and many other
characters.
- CNS
-
Taiwan's Chinese National Standard. This standard is the Taiwan
analogue to ASCII. In this document set, CNS refers to the code set defined
by CNS 11643. It contains the Chinese characters, phonetic symbols and radicals,
control codes, punctuation, and western alphabets, including Roman and Greek
characters. Each character is two bytes long, with the highest or most significant
bit of each byte set to zero. In other words, CNS uses the lower seven bits
of each byte. Due to the size of the Taiwan Chinese character set, the character
sets are divided into multiple codeplanes, with the default plane containing
the most commonly used characters. ISO 2022 provides mechanisms for shifting
from one codeplane to another.
After its revision in 1992, CNS
11643 defines 48,000 characters, which are divided among codeplanes 1-7, codeplanes
8-16 are undefined, but are included in the code set architecture. Codeplanes
1 and 2 (common and rarely used characters) are unaffected by the revision.
Characters that were in codeplane 14, a provisional user-defined plane, have
been standardized into codeplane 3, with the overflow in codeplane 4.
- code set
-
A set of unambiguous rules that establishes a character set
and the one-to-one relationship between each character in the character set
and its bit representation. For example, the English character set, including
punctuation and numbers, can be mapped to the ASCII code set in such a way
that each character corresponds to only one bit code, and no bit code corresponds
to more than one character. A code set is also called a coded character set.
- commit
-
Characters entered in the preedit area that are put in the
text block which is assembled for the application.
- EUC
-
Extended UNIX Code. Describes four code sets modelled on ISO-2022.
Each code set can contain one or more different character sets, like the Hangul
and Hanja character sets in KS C 5601. The four code sets are referred to
as code sets 0, 1, 2, and 3. In this guide, these code sets are sometimes
abbreviated as cs0, cs1, cs2, and cs3. Other internationalization efforts
sometimes call these code sets g0, g1, g2, and g3. code set 0 is also called
the primary code set, and code sets 1, 2, and 3 are called the supplementary
code sets. In the Korean and Chinese implementations of the EUC codes, the
primary code set (cs0) contains ASCII and begins with a zero in the most significant
bit.
- EUC-CNS
-
The EUC representation of CNS 11643. For code set 1, this
standard is the normal CNS code with a one in the most significant bit of
each byte. In other words, EUC-CNS equals CNS plus 0x8080. For example, the
CNS character 0x212A becomes the EUC-CNS character 0xA1AA. Or in binary, 00100001
00101010 becomes 10100001 10101010. For code sets 2 and 3, characters are
also prefixed by single shift bytes SS2 and SS3. In addition, code set 2 requires
a codeplane byte. The code of a code set-2 character is SS2 followed by codeplane
byte followed by EUC-CNS. The codeplane byte is plane number added to 0xA0.
For example, plane 2 has codeplane byte 0xA2.
- ISO
-
International Standards Organization. Composed of a number
of professional societies and companies, this organization studies and makes
recommendations on internationalization issues. ISO 2022 proposes and describes
the Extended UNIX Codes. Other ISO proposals include the European 8-bit code
and communication protocols for internationalization.
- locale
-
A locale describes a language or cultural environment. Its
setting affects the display or manipulation of language-dependent features.
Traditional Chinese Solaris software provides C for U.S.A, zh_TW
for Traditional Chinese extended UNIX code, and zh_TW.BIG5 for the Traditional Chinese Big5 locale.
- POSIX
-
Portable Operating System for Computer Environments. An IEEE
standards group comprising seven committees that create documents for standardizing
and internationalizing UNIX. POSIX document 1003.1 deals with the kernel and
system calls. 1003.2 concerns the C-shell and standard libraries. The other
five deal with real-time computing, communications and networking, and other
issues.
- Unicode
-
The international character set and encoding developed by
the Unicode Consortium.
- Wide character code (WC)
-
A constant-width four-byte code, called WC in Asian Solaris
documentation, for the internal representation of EUC codes using the new
ANSI-C data type wchar_t. Although EUC does not specify
limits on the size of the supplementary code sets (code set 0 is always one
byte), WC specifies a character as four bytes. Standardizing on four bytes
takes up more memory space than necessary if the environment is primarily
ASCII, but this practice also speeds processing time for strings of mixed
characters. The 1000th character always begins at byte 4000 (and the 0th character
starts at byte 0). This practice is useful for any type of indexing in applications.
- X/Open
-
X/Open started as a consortium of international
UNIX vendors from Europe, USA, and Asia. It is now one of the major standards
organizations like POSIX and ANSI; source of X/Open System Interface
Portability Guide.
|