<?xml version="1.0" encoding="utf-8"?>
<book fpi="-//Sun::SunSoft//DOCUMENT SOLUNICOSUPPT Version 1.0//en" role="numbered" label="alpha" id="ussoe" lang="en" userlevel="developer"><title lang="en">Unicode Support in the Solaris Operating Environment</title><bookinfo><bookbiblio><title lang="en">Unicode Support in the Solaris Operating Environment</title><authorgroup lang="en"><author lang="en"><firstname lang="en">John</firstname><surname lang="en">White</surname></author></authorgroup><isbn lang="en"/><pubsnumber lang="en"><gentext type="text">Part No: </gentext>806-5584</pubsnumber><releaseinfo lang="en"/><pubdate lang="en">May 2000</pubdate><publisher lang="en"><publishername lang="en">Sun Microsystems, Inc.</publishername><address lang="en"><street lang="en">901 San Antonio Road</street><city lang="en">Palo Alto<gentext type="text">, </gentext></city><state lang="en">CA<gentext type="text"></gentext></state><postcode lang="en">94303-4900</postcode><country lang="en">U.S.A.</country></address></publisher><copyright lang="en"><year lang="en">2000</year><holder lang="en">Sun Microsystems</holder></copyright><abstract lang="en"><para lang="en">Title: Unicode Support in the Solaris Operating Environment</para><para lang="en">Part number: 806-5584</para><para lang="en">Audience: System administrators, software developers</para><para lang="en">Page count: 34</para><para lang="en">Keywords: Solaris 8 operating environment, Internationalization, I18N, Unicode</para><para lang="en">This book provides information and software features for internationalizing software with Unicode</para></abstract></bookbiblio><legalnotice lang="en"><para lang="en">This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers.</para><para lang="en">Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd.</para><para lang="en">Sun, Sun Microsystems, the Sun logo, docs.sun.com, AnswerBook, AnswerBook2, 
 and Solaris are trademarks, registered trademarks, or service marks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. </para><para lang="en">The OPEN LOOK and <trademark class="trade" lang="en">Sun</trademark> Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun's licensees who implement OPEN LOOK GUIs and otherwise comply with Sun's written license agreements.</para><para lang="en">Federal Acquisitions: Commercial Software-Government Users Subject to Standard License Terms and Conditions.</para><para lang="en">DOCUMENTATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.</para><para lang="fr"/><para lang="en">Ce produit ou document est protégé par un copyright et distribué avec des licences qui en restreignent l'utilisation, la copie, la distribution, et la décompilation. Aucune partie de ce produit ou document ne peut être reproduite sous aucune forme, par quelque moyen que ce soit, sans l'autorisation préalable et écrite de Sun et de ses bailleurs de licence, s'il y en a. Le logiciel détenu par des tiers, et qui comprend la technologie relative aux polices de caractères, est protégé par un copyright et licencié par des fournisseurs de Sun.</para><para lang="en">Des parties de ce produit pourront être dérivées du système Berkeley BSD licenciés par l'Université de Californie. UNIX est une marque déposée aux Etats-Unis et dans d'autres pays et licenciée exclusivement par X/Open Company, Ltd. </para><para lang="en">Sun, Sun Microsystems, le logo Sun, docs.sun.com, AnswerBook, AnswerBook2, 
 et Solaris sont des marques de fabrique ou des marques déposées, ou marques de service, de Sun Microsystems, Inc. aux Etats-Unis et dans d'autres pays. Toutes les marques SPARC sont utilisées sous licence et sont des marques de fabrique ou des marques déposées de SPARC International, Inc. aux Etats-Unis et dans d'autres pays. Les produits portant les marques SPARC sont basés sur une architecture développée par Sun Microsystems, Inc.</para><para lang="en">L'interface d'utilisation graphique OPEN LOOK et <trademark class="trade" lang="en">Sun</trademark> a été développée par Sun Microsystems, Inc. pour ses utilisateurs et licenciés. Sun reconnaît les efforts de pionniers de Xerox pour la recherche et le développement du concept des interfaces d'utilisation visuelle ou graphique pour l'industrie de l'informatique. Sun détient une licence non exclusive de Xerox sur l'interface d'utilisation graphique Xerox, cette licence couvrant également les licenciés de Sun qui mettent en place l'interface d'utilisation graphique OPEN LOOK et qui en outre se conforment aux licences écrites de Sun.</para><para lang="en">CETTE PUBLICATION EST FOURNIE "EN L'ETAT" ET AUCUNE GARANTIE, EXPRESSE OU IMPLICITE, N'EST ACCORDEE, Y COMPRIS DES GARANTIES CONCERNANT LA VALEUR MARCHANDE, L'APTITUDE DE LA PUBLICATION A REPONDRE A UNE UTILISATION PARTICULIERE, OU LE FAIT QU'ELLE NE SOIT PAS CONTREFAISANTE DE PRODUIT DE TIERS. CE DENI DE GARANTIE NE S'APPLIQUERAIT PAS, DANS LA MESURE OU IL SERAIT TENU JURIDIQUEMENT NUL ET NON AVENU.</para></legalnotice><subjectset lang="en"><subject lang="en"><subjectterm lang="en">Programming &amp; Tools</subjectterm></subject><subject lang="en"><subjectterm lang="en">System Administration</subjectterm></subject></subjectset></bookinfo><preface id="preface-1" lang="en" role="preface"><gentext type="text">Preface</gentext><gentext type="toc">Preface</gentext><title lang="en">Preface</title><highlights lang="en"><para lang="en">The <citetitle lang="en">Unicode Support in the <trademark class="trade" lang="en">Solaris</trademark> Operating Environment</citetitle> white paper presents information and software features for internationalizing software with Unicode. </para></highlights><sect1 id="preface-9" lang="en"><title id="preface-2" lang="en">Who Should Use This Book</title><para lang="en">This white paper is intended for software developers who are interested in developing internationalized software with Unicode in the <trademark class="trade" lang="en">Solaris</trademark> operating environment. This white paper is part of a 4-part series on internationalization for Solaris software developers. The four internationalization white papers are:</para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en"><citetitle lang="en">Asian-Language Support in the <trademark class="trade" lang="en">Solaris</trademark> Operating Environment</citetitle></para></listitem><listitem lang="en"><para lang="en"><citetitle lang="en">Complex Text Layout Language Support in the <trademark class="trade" lang="en">Solaris</trademark> Operating Environment</citetitle></para></listitem><listitem lang="en"><para lang="en"><citetitle lang="en">Unicode Support in the <trademark class="trade" lang="en">Solaris</trademark> Operating Environment</citetitle></para></listitem><listitem lang="en"><para lang="en"><citetitle lang="en">Euro Currency Support in the <trademark class="trade" lang="en">Solaris</trademark> Operating Environment</citetitle></para></listitem></itemizedlist></sect1><sect1 id="preface-4" lang="en"><title lang="en">How This Book Is Organized</title><para lang="en"><link linkend="chapter1-1" lang="en">Chapter 1</link> describes Unicode, multilingual computing, and software internationalization.</para><para lang="en"><link linkend="chapter2-1" lang="en">Chapter 2</link> provides an overview of the Unicode standard.</para><para lang="en"><link linkend="chapter3-1" lang="en">Chapter 3</link> provides information about Unicode in the Solaris Operating Environment.</para><para lang="en"><link linkend="chapter4-1" lang="en">Chapter 4</link> addresses the technical concerns of Unicode in an internationalized application.</para><para lang="en"><link linkend="appendixa-1" lang="en">Appendix A</link> lists the codeset conversions.</para></sect1><sect1 id="preface-5" lang="en"><title lang="en">Related Books</title><para lang="en">The following books are related to software internationalization: </para><para lang="en"><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en"><citetitle lang="en">Creating Worldwide Software: Solaris International Developer's Guide</citetitle> Bill Tuthill and David Smallberg.</para></listitem><listitem lang="en"><para lang="en"><citetitle lang="en">Internationalization Guide, Version 2: Open Group Guide</citetitle> The Open Group</para></listitem><listitem lang="en"><para lang="en"><citetitle lang="en">International Language Environments Guide</citetitle> Solaris Developer Collection.</para></listitem><listitem lang="en"><para lang="en"><citetitle lang="en">Programming for the World: A Guide to Internationalization</citetitle> Sandra Martin O'Donnell.</para></listitem><listitem lang="en"><para lang="en"><citetitle lang="en">The Unicode Standard, Version 3.0</citetitle> The Unicode Consortium.</para></listitem><listitem lang="en"><para lang="en"><citetitle lang="en">X Windows on the World, Developing Internationalized Software with X, Motif, and CDE</citetitle> Thomas C. McFarland.</para></listitem></itemizedlist></para></sect1><sect1 id="sundocs-1" lang="en"><title lang="en">Ordering Sun Documents</title><para lang="en">Fatbrain.com, an Internet professional bookstore, stocks select product
documentation from Sun Microsystems, Inc.</para><para lang="en">For a list of documents and how to order them, visit the Sun Documentation
Center on Fatbrain.com at <ulink url="http://www1.fatbrain.com/documentation/sun"><literal moreinfo="none" lang="en">http://www1.fatbrain.com/documentation/sun</literal></ulink>.</para></sect1><sect1 id="sundocs-2" lang="en"><title lang="en">Accessing Sun Documentation Online</title><para lang="en">The <trademark class="service" lang="en">docs.sun.com</trademark> Web site enables
you to access Sun technical documentation online. You can browse the docs.sun.com
archive or search for a specific book title or subject. The URL is <ulink url="http://docs.sun.com"><literal moreinfo="none" lang="en">http://docs.sun.com</literal></ulink>.</para></sect1></preface><chapter id="chapter1-1" lang="en"><gentext type="text">Chapter 1</gentext><gentext type="toc">1.  Unicode and Multilingual Computing</gentext><title lang="en">Unicode and Multilingual Computing</title><highlights lang="en"><para lang="en">Today's global economy demands global computing solutions. Instant communications across continents--and computer platforms--characterize a business world at work 24 hours a day, 7 days a week. The widespread use of the Internet and e-commerce continue to create new international challenges. </para><para lang="en">More and more, users are demanding a computing environment to suit their own linguistic and cultural needs. They want applications and file formats they can share around the world, interfaces in their own language, and local time and date displays. Essentially, users want to write and speak at the keyboard the way they write and speak in the office. </para><para lang="en">The Solaris operating environment multilingual framework (including multiple character sets and multiple cultural attributes) uses the standard universal encoding codeset, Unicode (<citetitle lang="en">The Unicode Standard, Version 3.0</citetitle>). Unicode is well-suited to applications such as multilingual databases, e-commerce, and government research and reference.</para></highlights><sect1 id="chapter1-2" lang="en"><title lang="en"><gentext type="text">1.1 </gentext>Multilingual Computing</title><para lang="en">"Multilingual" computing can mean:</para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en">Multilanguage--multiple launches of one locale, one script.</para></listitem><listitem lang="en"><para lang="en">Multiscript--single launch of one locale, multiple scripts.</para></listitem><listitem lang="en"><para lang="en">Multilingual--single launch of multiple locales, multiple scripts.</para></listitem></itemizedlist><para lang="en">The movement from multilanguage to multiscript to multilingual implies an increased level of complexity in the underlying operating environment.</para><sect2 id="chapter1-3" lang="en"><title lang="en"><gentext type="text">1.1.1 </gentext>Multilanguage Environment</title><para lang="en">In a <emphasis lang="en">multilanguage</emphasis> environment, a locale supports one script and one set of cultural attributes. An application inherits all the language and cultural attributes of the current locale. Document text is written in one script and text manipulated according to the locale language rules. A separate application launch in another locale is required to use different language and cultural attributes. </para><para lang="en">For example, to write a document in Chinese, a user first sets the Chinese locale before launching the application. To write a Russian document, the Russian locale must be separately set and the application launched again. Chinese and Russian text cannot be mixed in the same document</para></sect2><sect2 id="chapter1-4" lang="en"><title lang="en"><gentext type="text">1.1.2 </gentext>Multiscript Environment</title><para lang="en">In a <emphasis lang="en">multiscript</emphasis> environment, a locale can support more than one script,  but only one locale can be set as current. An application creates a document in different scripts by tagging each separate script run (text in the same script). However, the current locale environment settings apply--for example, text is sorted according to the sorting rules of the current locale.</para><para lang="en">In the Chinese/Russian example above, rather than create two separate documents, the user creates one multiscript document containing both Chinese and Russian text. The cultural attributes of the active locale still apply--in the Chinese locale, the Chinese sorting rules apply to the mixed-script text. </para><note lang="en" role="note"><gentext type="text">Note - </gentext><para lang="en">In a Unicode locale, tagging script runs is not necessary because all language attributes are inherent in the Unicode codeset. </para></note></sect2><sect2 id="chapter1-5" lang="en"><title lang="en"><gentext type="text">1.1.3 </gentext>Multilingual Environment</title><para lang="en">In a <emphasis lang="en">multilingual</emphasis> environment, a locale can support multiple scripts and multiple cultural attributes, giving an application greater control over text manipulation. For example, a document containing text in multiple scripts can sort text according to the sort order of each script rather than the current locale. </para><para lang="en">In the Chinese/Russian example above, the Chinese locale sorting rules apply to the Chinese text and the Russian sorting rules apply to the Russian text. </para><para lang="en">The multilingual environment is closest to the ideal of multilingual computing. An application uses locale data from numerous locales, while at the same time allowing easy text manipulation in a variety of scripts. All users can easily work in their own language and be understood by others around the world. </para></sect2></sect1><sect1 id="chapter1-6" lang="en"><title lang="en"><gentext type="text">1.2 </gentext>Software Internationalization</title><para lang="en">Sun Microsystems defines the following levels at which an application can support a customer's international needs:</para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en">Internationalization</para></listitem><listitem lang="en"><para lang="en">Localization</para></listitem></itemizedlist><para lang="en">Software <emphasis lang="en">internationalization</emphasis> is the process of designing and implementing software to transparently manage different linguistic and cultural conventions without additional modification. The same binary copy of an application should run on any localized version of the Solaris operating environment, without requiring source code changes or recompilation. </para><para lang="en">Software <emphasis lang="en">localization</emphasis> is the process of adding language translation (including text messages, icons, buttons, and so on), cultural data, and components (such as input methods and spell checkers) to a product to meet regional market requirements. </para><para lang="en">The Solaris operating environment is an example of a product that supports both internationalization and localization. The Solaris operating environment is a single internationalized binary that is localized into various languages (for example, French, Japanese, and Chinese) to support the language and cultural conventions of each language. </para><para lang="en">Properly designed applications can easily accommodate a localized interface without extensive modification. One suggestion for creating easy-to-localize software is to first internationalize the software and then encapsulate the language- and cultural-specific elements in a locale-specific database. This greatly simplifies the localization process, should a developer choose to localize in the future. </para><para lang="en">At a minimum, Sun Microsystems strongly encourages developers to internationalize their software. Internationalized applications can run on any localized version of the Solaris operating environment and easily manage the language and cultural preferences. </para></sect1><sect1 id="chapter1-7" lang="en"><title lang="en"><gentext type="text">1.3 </gentext>Internationalization Framework</title><para lang="en">A properly internationalized application <emphasis lang="en">separates language- and cultural-specific information from the application code</emphasis>. The Solaris operating environment internationalization framework achieves this with:</para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en">Locales.</para></listitem><listitem lang="en"><para lang="en">Localizable interface.</para></listitem><listitem lang="en"><para lang="en">Codeset independence.</para></listitem></itemizedlist><para lang="en">A <emphasis lang="en">locale</emphasis> is the language and cultural data set by the user and dynamically loaded into memory at run time. The locale settings are applied to the operating system and to subsequent application launches. </para><para lang="en">The Solaris operating environment includes APIs for developers to directly access language and cultural data in the current locale. Applications can run in any locale without prior input of language or cultural data. For example, an application does not need to encode a particular currency symbol. By calling the appropriate system API, the current locale currency symbol is returned.</para><para lang="en">A <emphasis lang="en">localizable interface</emphasis> considers variations in an interface translated into another language. The Solaris operating environment provides messaging APIs and utilities to collect, generate, and process messages. </para><para lang="en">In an application, <emphasis lang="en">codeset independence</emphasis> does not assume a particular codeset. For example, text-handling routines should not define in advance the size of the character codeset.</para><para lang="en">For more information about designing applications with Unicode, see  <link linkend="chapter4-1" lang="en">Chapter 3, <emphasis lang="en">Technical Considerations</emphasis></link>. For more information about the internationalization framework, see the whitepaper <citetitle lang="en">Asian Language Support in the Solaris Operating Environment</citetitle>. </para></sect1><sect1 id="chapter1-8" lang="en"><title lang="en"><gentext type="text">1.4 </gentext>Supporting the Unicode Standard</title><para lang="en">Unicode (Universal Codeset) is a universal character encoding scheme developed and promoted by the Unicode Consortium, a non-profit organization which includes Sun Microsystems. The Unicode standard encompasses most alphabetic, ideographic, and symbolic characters.</para><para lang="en">Using one universal codeset enables applications to support text from multiple scripts  in the same documents without elaborate tagging. However, applications must treat Unicode as any another codeset--applying codeset independence to Unicode as well. </para><para lang="en">Unicode locales are called the same way and function the same way as all other locales in the Solaris operating environment. These locales provide the extra benefits that the Unicode codeset brings to the work environment, including the ability to create text in multiple scripts without having to switch locales. Sun Microsystems provides the same level of Unicode locale support for both 32-bit and 64-bit Solaris environments.</para></sect1><sect1 id="chapter1-9" lang="en"><title lang="en"><gentext type="text">1.5 </gentext>Benefits of Unicode</title><para lang="en">Support for Unicode provides many benefits to application developers, including:</para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en">Global source and binary.</para></listitem><listitem lang="en"><para lang="en">Support for mixed-script computing environments.</para></listitem><listitem lang="en"><para lang="en">Improved cross-platform data interoperability through a common codeset.</para></listitem><listitem lang="en"><para lang="en">Space-efficient encoding scheme for data storage.</para></listitem><listitem lang="en"><para lang="en">Reduced time-to-market for localized products.</para></listitem><listitem lang="en"><para lang="en">Expanded market access.</para></listitem></itemizedlist><para lang="en">Developers can use Unicode to create global applications. Users can exchange data more freely using one flat codeset without elaborate code conversions to comprehend characters.</para><para lang="en">In the Solaris operating environment internationalization framework, Unicode is "just another codeset." By adopting and implementing codeset independence to design, applications can handle different codesets without extensive code rework to support specific languages.</para></sect1></chapter><chapter id="chapter2-1" lang="en"><gentext type="text">Chapter 2</gentext><gentext type="toc">2.  Unicode</gentext><title lang="en">Unicode</title><highlights lang="en"><para lang="en">In most writing systems, keyboard input is converted into character codes, stored in memory, and converted to glyphs in a particular font for display and printing. The collection of characters and character codes form a codeset. To represent characters of different languages, a different codeset is used.</para><para lang="en">A character code in one codeset, however, does not necessarily represent the same character in another codeset. For example, the character code <computeroutput moreinfo="none" lang="en">0xB1</computeroutput> is the plus-minus sign (+-) in Latin-1 (ISO 8859-1 codeset), capital BE in Cyrillic (ISO 8859-5 codeset), and does not represent anything in Arabic (ISO 8859-6 codeset) or Traditional Chinese (CJK unified ideographs).</para><para lang="en">In Unicode, every character, ideograph, and symbol has a unique character code, eliminating any confusion between character codes of different codesets. In Unicode, multiple codesets need not be defined. Unicode represents characters from most of the world's languages as well as publishing characters, mathematical and technical symbols, and punctuation characters. This universal representation for text data has been further enhanced and extended in the latest release of Unicode: <citetitle lang="en">The Unicode Standard, Version 3.0</citetitle>.</para></highlights><sect1 id="chapter2-2" lang="en"><title lang="en"><gentext type="text">2.1 </gentext>Unicode Coded Representations</title><para lang="en">In recent years, the Unicode Consortium and other related organizations have developed different formats to represent and store a Unicode codeset. To represent characters from all major languages in multibyte format, the ISO/IEC International Standard 10646-1 (commonly referred to as 10646) has defined the Universal Multiple-Octet Coded Character Set (UCS) format. Character forms contained in the 10464 specifications are:</para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en">Universal Coded Character Set-2 (UCS-2) also known as Basic Multilingual Plane (BMP)--characters are encoded in two bytes on a single plane.</para></listitem><listitem lang="en"><para lang="en">Universal Coded Character Set-4 (UCS-4)--characters encoded in four bytes on multiple planes and multiple groups.</para></listitem><listitem lang="en"><para lang="en">UCS Transformation Format 16-bit form (UTF-16)--extended variant of UCS-2 with characters encoded in 2-4 bytes.</para></listitem><listitem lang="en"><para lang="en">UCS Transformation Format 8-bit form (UTF-8)--a transformation format using characters encoded in 1-6 bytes.</para></listitem></itemizedlist><para lang="en">UCS-2 defines a 64K coding space, or BMP, to represent character codes in a two-octet row and cell format. The row and cell octets designate the cell location of a particular character code within a 256 by 256 (00-FF) plane.</para><para lang="en">UCS-4 defines a four-octet coding space divided into four units: group, plane, row, and cell. The row and cell octets designate the cell location of a particular character code within a plane. The plane octet designates the plane number (00-FF), and the group octet the group number (00-7F) to which the plane belongs. In total, there are 256 planes occurring 127 times.</para><figure float="0" id="chapter2-fig-3" lang="en"><gentext type="text">Figure 2-1 </gentext><title lang="en">UCS-2 and UCS-4 coding schemes</title><graphic filename="figures/Fig2-1.eps.gif" width="759" depth="121" lang="en"/></figure><para lang="en"/><para lang="en">In addition to the 10646 UCS forms, Unicode defines another form called UTF (UCS Transformation Format). One version of UTF is an extended UCS-2 encoding form designed to include characters from outside the BMP 64K coding space. This form was first called UCS-2E (extended UCS-2), but is now known as UTF-16 (UCS Transformation Format 16-bit form).</para><para lang="en">The UTF-16 form translates a range of UCS-4 codes into a two-octet encoded string. It does this by reserving an area of codes in the BMP coding space for mapping to and from 16 planes of group 00 of UCS-4. Each plane is assigned a certain set of code positions in the two-octet UCS-2 scheme. Specifically, Planes 01 to 0E (14 planes, or 14 x 65,536 = 917,504 characters) are reserved for standard encodings and Planes 0F and 10 (2 planes, or 2 x 65,536 = 131,072 characters) are reserved for private use.</para><para lang="en">Although UCS-4 and UTF-16 provide comprehensive ways to represent several character sets, they do not preserve the byte values for ASCII characters. Because all UNIX systems are based on an ASCII kernel, they reserve certain character codes for I/O operations, such as the null character as a string terminator, the slash (/) character as a path name separator, and the DEL and SPACE control characters. To circumvent this problem, another version of UTF was devised, called FSS-UTF (File System Safe-UTF), now commonly known as UTF-8.</para><para lang="en">UTF-8 is an encoding scheme which maps the entire UCS-4 character set to a series  of single-octet and multi-octet strings. In this scheme, the most significant bit is 0 for ASCII characters and 1 for all other characters. The ASCII character range is contained in a single-byte encoding, and all other characters in a range from 2 up to 6-byte encoding. </para><table frame="topbot" id="chapter2-tbl-1" lang="en"><gentext type="text">Table 2-1 </gentext><title lang="en">UTF-8 encoding scheme</title><tgroup cols="4" colsep="0" rowsep="0" lang="en"><colspec colwidth="7.68*"/><colspec colwidth="13.25*"/><colspec colwidth="14.76*"/><colspec colwidth="64.32*"/><thead lang="en"><row rowsep="1" lang="en"><entry lang="en"><para lang="en">Bits</para></entry><entry lang="en"><para lang="en">Hex Min</para></entry><entry lang="en"><para lang="en">Hex Max</para></entry><entry lang="en"><para lang="en">UTF-8 Binary Encoding</para></entry></row></thead><tbody lang="en"><row lang="en"><entry lang="en"><para lang="en">7</para></entry><entry lang="en"><para lang="en">00000000</para></entry><entry lang="en"><para lang="en">0000007F</para></entry><entry lang="en"><para lang="en">0xxxxxxx</para></entry></row><row lang="en"><entry lang="en"><para lang="en">11</para></entry><entry lang="en"><para lang="en">00000080</para></entry><entry lang="en"><para lang="en">000007FF</para></entry><entry lang="en"><para lang="en">110xxxxx 10xxxxxx</para></entry></row><row lang="en"><entry lang="en"><para lang="en">16</para></entry><entry lang="en"><para lang="en">00000800</para></entry><entry lang="en"><para lang="en">0000FFFF</para></entry><entry lang="en"><para lang="en">1110xxxx 10xxxxxx 10xxxxxx</para></entry></row><row lang="en"><entry lang="en"><para lang="en">21</para></entry><entry lang="en"><para lang="en">00010000</para></entry><entry lang="en"><para lang="en">001FFFFF</para></entry><entry lang="en"><para lang="en">11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</para></entry></row><row lang="en"><entry lang="en"><para lang="en">26</para></entry><entry lang="en"><para lang="en">00200000</para></entry><entry lang="en"><para lang="en">03FFFFFF</para></entry><entry lang="en"><para lang="en">111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</para></entry></row><row lang="en"><entry lang="en"><para lang="en">31</para></entry><entry lang="en"><para lang="en">04000000</para></entry><entry lang="en"><para lang="en">7FFFFFFF</para></entry><entry lang="en"><para lang="en">1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</para></entry></row></tbody></tgroup></table><note lang="en" role="note"><gentext type="text">Note - </gentext><para lang="en">The UTF-8 scheme does not use any ASCII byte values in its 2- to 6-byte sequences, yet ASCII values remain 8-bit within the new byte structure. Thus, UTF-8 is compatible with all legacy file systems and other systems that parse for the ASCII byte, while UCS-2/UTF-16 and UCS-4 are not compatible with ASCII.</para><para lang="en">Furthermore, applications supporting Unicode can use existing data in ASCII format without applying a conversion utility. In addition, there is support within the Internet community for adopting UTF-8 as the Internet encoding standard.</para></note><para lang="en">In addition to its backward compatibility with 7-bit ASCII, UTF-8 is a space-efficient encoding scheme when the encoded data needs only one-byte or less (as for English and other Roman character-based writing systems). Because UTF-8 stores one-byte data as one byte, rather than, for example, the two bytes required by UTF-16, this can significantly decrease the storage space required to hold large blocks of international data.</para><para lang="en">Because of its flexibility and compatibility with ASCII and UNIX, Unicode support of the UTF-8 format is used in the Solaris operating environment. UTF-8 provides developers with a format compatible with existing internationalized environments and an easy path for Internet and legacy data interoperability. As a file system safe format, UTF-8 supports one-byte unit I/O operations and can represent the Unicode formats UCS-2 and UCS-4. Furthermore, UTF-8 fits well within the XPG internationalization framework.</para></sect1></chapter><chapter id="chapter3-1" lang="en"><gentext type="text">Chapter 3</gentext><gentext type="toc">3.  Unicode in the Solaris 8 Operating Environment</gentext><title lang="en">Unicode in the Solaris 8 Operating Environment</title><highlights lang="en"><para lang="en">The support of Unicode, Version 3.0 in the Solaris 8 Operating Environment's Unicode locales has provided an enhanced framework for developing multiscript applications. Properly internationalized applications require no changes to support the Unicode locales. All internationalized CUI and GUI utilities and commands in the Solaris operating environment are available in Unicode locales without modification. </para><para lang="en">All Unicode locales in the Solaris operating environment are based on the UTF-8 format. Each locale includes a base language in the UTF-8 codeset and regional data related to the base language and its cultural conventions (such as local formatting rules, text messages, help messages, and other related files). Each locale also supports several other scripts for input, display, code conversion, and printing. </para></highlights><sect1 id="chapter3-2" lang="en"><title lang="en"><gentext type="text">3.1 </gentext>Unicode UTF-8 <computeroutput moreinfo="none" lang="en">en_US.UTF-8</computeroutput> Locale</title><para lang="en"><computeroutput moreinfo="none" lang="en">en_US.UTF-8</computeroutput> is the flagship Unicode locale in the Solaris operating environment. The <computeroutput moreinfo="none" lang="en">en_US.UTF-8</computeroutput> locale is an American English-based locale with multiscript processing support for characters in many different languages. New and enhanced features of all Unicode locales  include support of the Unicode 3.0 character set, complex text layout scripts in correct rendition, native Asian input methods, more MIME character sets in <computeroutput moreinfo="none" lang="en">dtmail</computeroutput>, various new <computeroutput moreinfo="none" lang="en">iconv</computeroutput> code conversions, and an enhanced PostScript print filter.</para><para lang="en">All Unicode locales in the Solaris operating environment support multiple scripts. Thirteen input modes area available: English/European, Cyrillic, Greek, Arabic, Hebrew, Thai, Unicode Hex, Unicode Octal, Table lookup, Japanese, Korean, Simplified Chinese, and Traditional Chinese.</para><para lang="en">Users can input characters from any combination of scripts and the entire Unicode coding space.</para><note lang="en" role="note"><gentext type="text">Note - </gentext><para lang="en">To choose an input mode, press the <computeroutput moreinfo="none" lang="en">Compose</computeroutput> key and a two-letter code. For example, to input text in Thai, press <computeroutput moreinfo="none" lang="en">Compose+tt</computeroutput>. Alternatively, click the status area and select an input mode as shown in <link linkend="chapter3-fig-1" lang="en">Figure 3-1</link>. (To select the default English/European mode, press <computeroutput moreinfo="none" lang="en">Control</computeroutput>+<computeroutput moreinfo="none" lang="en">Space</computeroutput>.) </para></note><table frame="topbot" id="chapter1-tbl-1" lang="en"><gentext type="text">Table 3-1 </gentext><title lang="en">UTF-8 Input Mode two-letter codes</title><tgroup cols="2" colsep="0" rowsep="0" lang="en"><colspec colname="colspec0" colwidth="50*"/><colspec colname="colspec1" colwidth="50*"/><thead lang="en"><row rowsep="1" lang="en"><entry lang="en"><para lang="en">Language</para></entry><entry lang="en"><para lang="en">Code</para></entry></row></thead><tbody lang="en"><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Cyrillic</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">cc</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Greek</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">gg</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Thai</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">tt</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Arabic</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">ar</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Hebrew</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">hh</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode Hex</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">uh</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en"> Unicode Octal</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">uo</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Lookup</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">ll</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Japanes</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">ja</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Korean</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">ko</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Simplified Chinese</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">sc</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Traditional Chinese</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">tc</computeroutput></para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">English/European</para></entry><entry colname="colspec1" lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">Control+Space</computeroutput></para></entry></row></tbody></tgroup></table><figure float="0" id="chapter3-fig-1" lang="en"><gentext type="text">Figure 3-1 </gentext><title lang="en">UTF-8 Input Mode selection</title><graphic filename="figures/Fig3-2.tiff.gif" width="370" depth="278" scale="60" lang="en"/></figure><para lang="en">To input text from a Lookup table, select the Lookup input mode. A lookup table with all input modes and various symbol and technical codesets appears, as shown in <link linkend="chapter3-fig-2" lang="en">Figure 3-2</link>.</para><para lang="en">The Table lookup input mode is the easiest for non-native speakers to input characters in a foreign language--a lookup window displays characters from a selected script, as shown for the Asian input mode in <link linkend="chapter3-fig-3" lang="en">Figure 3-3</link>.</para><para lang="en">The Arabic, Hebrew, and Thai input modes provide full complex text layout features, including right-to-left display and context-sensitive character rendering. The Unicode octal and hexadecimal code input modes generate Unicode characters from their octal and hexadecimal equivalents, respectively.  </para><para lang="en">The Japanese, Korean, Simplified Chinese, and Traditional Chinese input modes provide full native Asian input.</para><figure float="0" id="chapter3-fig-2" lang="en"><gentext type="text">Figure 3-2 </gentext><title lang="en">UTF-8 Table Lookup</title><graphic filename="figures/Fig3-1.tiff.gif" width="373" depth="471" scale="60" lang="en"/></figure><figure float="0" id="chapter3-fig-3" lang="en"><gentext type="text">Figure 3-3 </gentext><title lang="en">Asian input mode</title><graphic filename="figures/Fig3-3.tiff.gif" width="363" depth="360" scale="50" lang="en"/></figure><para lang="en">For more information on each input method, refer to the chapter <emphasis lang="en">Overview of en_US.UTF-8 Locale Support</emphasis> in the latest Solaris <citetitle lang="en">International Language Environments Guide</citetitle>, <citetitle lang="en">ATOK12 User's Guide</citetitle>, <citetitle lang="en">Wnn6 User's Guide</citetitle>, <citetitle lang="en">cs00 User's Guide</citetitle>, <citetitle lang="en">Korean Solaris User's Guide</citetitle>, <citetitle lang="en">Simplified Chinese Solaris User's Guide</citetitle>, and <citetitle lang="en">Traditional Chinese Solaris User's Guide</citetitle>. 	</para><para lang="en">The Unicode locales can use the enhanced <computeroutput moreinfo="none" lang="en">mp(1)</computeroutput> printing filter to print text files. <computeroutput moreinfo="none" lang="en">mp(1)</computeroutput> prints flat text files written in UTF-8 using various Solaris system and printer resident fonts (such as bitmap, Type1, TrueType) depending on the script. The output is standard PostScript. For more information, refer to the <computeroutput moreinfo="none" lang="en">mp(1) man</computeroutput> page. 	 	</para><para lang="en">The Unciode locale supports various MIME character sets in <computeroutput moreinfo="none" lang="en">dtmail</computeroutput>, including various Latin, Greek, Cyrillic, Thai, and Asian character sets. Some of the example character sets are: ISO-8859-1 ~ 10, 13, 14, 15, UTF-8, UTF-7, UTF-16, UTF-16BE, UTF-16LE, Shift_JIS, ISO-2022-JP, EUC-KR, ISO-2022-KR, TIS-620, Big5, GB2312, 	KOI8-R, KOI8-U, and ISO-2022-CN. With this support, users can send and receive email messages encoded in MIME character sets from almost any region in the world. <computeroutput moreinfo="none" lang="en">dtmail</computeroutput> automatically decodes e-mail by recognizing the MIME character set and content transfer encoding in the message. The sender specifies the MIME character set for the recipient mail user agent.</para><figure float="0" id="chapter3-fig-4" lang="en"><gentext type="text">Figure 3-4 </gentext><title lang="en">Multiple character sets in <computeroutput moreinfo="none" lang="en">dtmail</computeroutput></title><graphic filename="figures/Fig3-4.tiff.gif" width="341" depth="315" scale="55" lang="en"/></figure></sect1><sect1 id="chapter3-3" lang="en"><title lang="en"><gentext type="text">3.2 </gentext>Codeset Conversion</title><para lang="en">The Solaris operating environment locale supports enhanced code conversion among the major codesets of several countries. <link linkend="chapter3-fig-5" lang="en">Figure 3-5</link> shows the codeset conversions between UTF-8 and many other codesets.  </para><figure float="0" id="chapter3-fig-5" lang="en"><gentext type="text">Figure 3-5 </gentext><title lang="en">Unicode codeset conversions</title><graphic filename="figures/Fig3-5.tiff.gif" width="356" depth="258" scale="40" lang="en"/></figure><para lang="en">Codesets can be converted using the <computeroutput moreinfo="none" lang="en">sdtconvtool</computeroutput> utility or the <computeroutput moreinfo="none" lang="en">iconv(1)</computeroutput> command. <computeroutput moreinfo="none" lang="en">sdtconvtool</computeroutput> detects available <computeroutput moreinfo="none" lang="en">iconv</computeroutput> code conversions and presents them in an easy-to-use format.  </para><figure float="0" id="chapter3-fig-6" lang="en"><gentext type="text">Figure 3-6 </gentext><title lang="en"><computeroutput moreinfo="none" lang="en">sdtconvtool</computeroutput> for converting between codesets</title><graphic filename="figures/Fig3-6.tiff.gif" width="271" depth="269" scale="55" lang="en"/></figure><para lang="en">Users can also add their own code conversions and use them in <computeroutput moreinfo="none" lang="en">iconv(3)</computeroutput> functions, <computeroutput moreinfo="none" lang="en">iconv(1)</computeroutput> command line utilities, and <computeroutput moreinfo="none" lang="en">sdtconvtool(1)</computeroutput>. For more information on user-extensible, user-defined code conversions, refer to the <computeroutput moreinfo="none" lang="en">geniconvtbl(1)</computeroutput> and <computeroutput moreinfo="none" lang="en">geniconvtbl(4)</computeroutput> <computeroutput moreinfo="none" lang="en">man</computeroutput> pages.</para><para lang="en">Developers can use <computeroutput moreinfo="none" lang="en">iconv(3)</computeroutput> to access the same functionality. This includes conversions to and from UTF-8 and many ISO-standard codesets, including UCS-2, UCS-4, UTF-7, UTF-16, KO18-R, Japanese EUC, Korean EUC, Simplified Chinese EUC, Traditional Chinese EUC, GBK, PCK (Shift JIS), BIG5, Johap, ISO-2022-JP,  ISO-2022-KR, and ISO-2022-CN. </para><para lang="en">For a detailed listing of the supported code conversions, see <link linkend="appendixa-1" lang="en">Appendix A, <citetitle lang="en">Codeset Conversions</citetitle>. </link></para></sect1><sect1 id="chapter3-4" lang="en"><title lang="en"><gentext type="text">3.3 </gentext>European Unicode Locales</title><para lang="en">In the Solaris 8 operating environment, five European Unicode locales offer the same level of support as <computeroutput moreinfo="none" lang="en">en_US.UTF-8</computeroutput> with modifications for language and cultural data. </para><para lang="en">The five European Unicode locales are:</para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">fr_FR.UTF-8</computeroutput> (French)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">de_DE.UTF-8</computeroutput> (German)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">it_IT.UTF-8</computeroutput> (Italian)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">es_ES.UTF-8</computeroutput> (Spanish)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">sv_SE.UTF-8</computeroutput> (Swedish)</para></listitem></itemizedlist><para lang="en">Each locale contains the same feature set as <computeroutput moreinfo="none" lang="en">en_US.UTF-8</computeroutput> and regional definitions for  numeric notation, date and time, currency, and translated text messages.</para><para lang="en">The following additional five European locales support the Euro currency symbol and monetary formatting conventions: </para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">fr_FR.UTF-8@euro</computeroutput> (French with euro monetary convention)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">de_DE.UTF-8@euro</computeroutput> (German with euro monetary convention)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">it_IT.UTF-8@euro</computeroutput> (Italian with euro monetary convention)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">es_ES.UTF-8@euro</computeroutput> (Spanish with euro monetary convention)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">sv_SE.UTF-8@euro</computeroutput> (Swedish with euro monetary convention)</para></listitem></itemizedlist><note lang="en" role="note"><gentext type="text">Note - </gentext><para lang="en">All Unicode locales (including en_US.UTF-8 and Asian Unicode locales) support input and output of the new euro currency <emphasis lang="en">symbol</emphasis>.</para></note><figure float="0" id="chapter3-fig-7" lang="en"><gentext type="text">Figure 3-7 </gentext><title lang="en">Euro currency symbol</title><graphic filename="figures/Fig3-7.tiff.gif" width="83" depth="75" lang="en"/></figure></sect1><sect1 id="chapter3-5" lang="en"><title lang="en"><gentext type="text">3.4 </gentext>Asian Unicode Locales</title><para lang="en">The Solaris 8 operating environment also supports four Unicode locales with the same scope as <computeroutput moreinfo="none" lang="en">en_US.UTF-8</computeroutput> and the European Unicode locales, with the necessary language and cultural modifications:</para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">ja_JP.UTF-8</computeroutput> (Japanese)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">ko_KR.UTF-8</computeroutput> (Korean)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">zh_CN.UTF-8</computeroutput> (Simplified Chinese)</para></listitem><listitem lang="en"><para lang="en"><computeroutput moreinfo="none" lang="en">zh_TW.UTF-8</computeroutput> (Traditional Chinese)</para></listitem></itemizedlist><para lang="en">Each Asian Unicode locale is tailored to the Asian customer's needs. For example, the Japanese Unicode locale supports additional characters from JIS X0212-1990 at the presentation layer. All existing native Asian input methods and systems are also transparently supported.</para></sect1><sect1 id="chapter3-6" lang="en"><title lang="en"><gentext type="text">3.5 </gentext>Unicode Font Resources</title><para lang="en"><citetitle lang="en">The Unicode Standard, Version 3.0</citetitle> contains 49,194 characters from the world's scripts, with over 25,000 ideographic characters for Chinese, Japanese, and Korean. The font resources representing these characters, however, are not always one to one--some Unicode code points associate different, multiple glyphs, enabling specific code points to be rendered correctly  based upon their context. For example, in Asian languages, the Unified han glyphs are written and displayed differently in Simplified Chinese, Traditional Chinese, Japanese kanji, and Korean hanja ideographs. </para><para lang="en">To manage these difficulties, the Solaris operating environment contains an output method combining existing fonts to form a Unicode font set, instead of providing a single Unicode font. The Solaris 8 operating environment supports the following range of scripts: </para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en">English/European</para></listitem><listitem lang="en"><para lang="en">Greek, Turkish, Cyrillic</para></listitem><listitem lang="en"><para lang="en">Arabic, Hebrew, Thai</para></listitem><listitem lang="en"><para lang="en">Simplified Chinese, Traditional Chinese, Japanese, Korean</para></listitem></itemizedlist><para lang="en">For European scripts, there is a one-to-one mapping between Unicode characters and corresponding glyphs. For Complex Text Layout language text (Arabic, Hebrew, Thai), the Solaris Universal Multiscript Layout Engine pre-processes the text (right-to-left swapping, contextual analysis, and so on) before rendering the associated glyphs.</para><para lang="en">For Asian characters, the Solaris operating environment output methods provide dynamic remapping of the font and glyph index according to the locale definition. Each locale contains a font table with mapping mechanisms specifying which font and glyph to use for each character code. The mechanism remaps the Unicode code point values to existing Chinese, Japanese, and Korean fonts and glyph index pairs. A locale administrator can define the sort priority among fonts. For example, the mechanism may search the Simplified Chinese fonts for the appropriate glyph and then search the Traditional Chinese fonts, and so on.</para></sect1></chapter><chapter id="chapter4-1" lang="en"><gentext type="text">Chapter 4</gentext><gentext type="toc">4.  Technical Considerations</gentext><title lang="en">Technical Considerations</title><sect1 id="chapter4-2" lang="en"><title lang="en"><gentext type="text">4.1 </gentext>Internationalized Applications with Unicode</title><para lang="en">The Unicode codeset enables developers to write applications that support multiple scripts simultaneously. The base language script and one or more additional scripts, depending on the Unicode locale, can be input, displayed, and printed. Distributed applications within network environments can also provide individual users access to different language environments simultaneously. </para><para lang="en">By itself,  an application using Unicode is not fully internationalized. For example, if an application customizes data handling for Unicode directly, it needs to provide codeset converters as wrappers to support a codeset other than Unicode. This approach is <emphasis lang="en">direct Unicode localization</emphasis>--not internationalization. With direct localization, developers may localize an application that duplicates or conflicts with the localization provided by the operating system. In addition, an application may assume that all characters are represented in two-octet cells,  which conflicts with UTF-8. </para><para lang="en">To properly internationalize an application, use the following guidelines:</para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en">Avoid direct access with Unicode. (This is a task of the platform's internationalization framework.) </para></listitem><listitem lang="en"><para lang="en">Use the POSIX model for multibyte and wide-character interfaces.  See Section 4.2, <citetitle lang="en">Unicode Application Interfaces</citetitle>.</para></listitem><listitem lang="en"><para lang="en">Only call APIs that the internationalization framework provides for language and cultural-specific operations. All POSIX, X11, Motif, and CDE interfaces are available to Unicode locales.</para></listitem><listitem lang="en"><para lang="en">Remain codeset independent.</para></listitem></itemizedlist></sect1><sect1 id="chapter4-3" lang="en"><title lang="en"><gentext type="text">4.2 </gentext>Unicode Application Interfaces</title><para lang="en">When internationalizing applications for Unicode, developers should use the POSIX or X Window model. These models define two sets of interfaces--multibyte and wide character--without specifying the encoding methods. </para><para lang="en">Standard multibyte codesets contain characters of varying widths; from one to several bytes. Characters are represented in minimal storage space, with the fewest number of bytes possible. Because multibyte codesets contain characters of varying widths, they are not conveniently processed by standard functions. </para><para lang="en">The Unicode codeset provides the necessary format for both multibyte and wide-character representation. In the Solaris operating environment Unicode locales, multibyte interfaces use UTF-8 character set representation and wide-character interfaces use UCS-4 representation. </para></sect1><sect1 id="chapter4-4" lang="en"><title lang="en"><gentext type="text">4.3 </gentext>Font Resources</title><para lang="en">Properly internationalized applications require only a few changes to run properly in the Solaris operating environment Unicode locales. One required change is to set the proper resource definitions for font sets (<computeroutput moreinfo="none" lang="en">FontSet</computeroutput>) or font list (<computeroutput moreinfo="none" lang="en">XmFontList</computeroutput>) in the application's resource file. </para><para lang="en">The <computeroutput moreinfo="none" lang="en">en_US.UTF-8</computeroutput> locale supports the following set of font character sets as the <computeroutput moreinfo="none" lang="en">FontSet</computeroutput>: </para><itemizedlist lang="en" mark="bullet"><listitem lang="en"><para lang="en">ISO 8859-1 (Latin-1)</para></listitem><listitem lang="en"><para lang="en">ISO 8859-2 (Latin-2)</para></listitem><listitem lang="en"><para lang="en">ISO 8859-4 (Latin-4) </para></listitem><listitem lang="en"><para lang="en">ISO 8859-5 (Latin/Cyrillic)</para></listitem><listitem lang="en"><para lang="en">ISO 8859-7 (Latin/Greek)</para></listitem><listitem lang="en"><para lang="en">ISO 8859-9 (Latin-5)</para></listitem><listitem lang="en"><para lang="en">ISO 8859-15 (Latin-9)</para></listitem><listitem lang="en"><para lang="en">ISO 8859-6 based one (Arabic)</para></listitem><listitem lang="en"><para lang="en">ISO 8859-8 (Hebrew)</para></listitem><listitem lang="en"><para lang="en">TIS 620-2533 based one (Thai)</para></listitem><listitem lang="en"><para lang="en">BIG5 (Traditional Chinese)</para></listitem><listitem lang="en"><para lang="en">GB 2312-1980 (Simplified Chinese)</para></listitem><listitem lang="en"><para lang="en">JIS X0201-1976, JIS X0208-1983 (Japanese)</para></listitem><listitem lang="en"><para lang="en">KS C 5601-1992 Annex 3 (Korean)</para></listitem></itemizedlist></sect1><sect1 id="chapter4-5" lang="en"><title lang="en"><gentext type="text">4.4 </gentext>Setting Resource Definitions</title><para lang="en">To create a font set for an application, the resource definition should contain the complete set of fonts supported by the Unicode locale. For example:</para><programlisting format="linespecific" lang="en" role="fragment">fs = XCreateFontSet(display,
"-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-1,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-2,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-4,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-5,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-6,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-7,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-8,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-9,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-15,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-big5-1,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-gb2312.1980-0,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-jisx0201.1976-0,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-jisx0208.1983-0,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-ksc5601.1992-3,
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-tis620.2533-0",
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-unicode-fontspecific",
  &amp;missing_ptr, &amp;missing_count, &amp;def_string);	</programlisting><para lang="en">Or,  more simply:</para><programlisting format="linespecific" lang="en" role="fragment">fs = XCreateFontSet(display, "-dt-interface system-medium-r-normal-s*utf*",
&amp;missing_ptr, &amp;missing_count, &amp;def_string);</programlisting><para lang="en">The <computeroutput moreinfo="none" lang="en">XmFontList</computeroutput> resource definition of an application should also include all fonts for every character set supported by the locale. For example: </para><programlisting format="linespecific" lang="en" role="fragment">!
! This is an example XmNFontList definition for en_US.UTF-8 locale:
*fontList:\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-1;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-2;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-4;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-5;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-6;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-7;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-8;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-9;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-15;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-big5-1;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-gb2312.1980-0;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-jisx0201.1976-0;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-jisx0208.1983-0;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-ksc5601.1992-3;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-tis620.2533-0;\
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-unicode-fontspecifc:</programlisting><para lang="en">Or,  more simply:</para><programlisting format="linespecific" lang="en" role="fragment">!
! This is an example XmNFontList definition for en_US.UTF-8 locale:
*fontList: -dt-interface system-medium-r-normal-s*utf*:</programlisting></sect1></chapter><appendix id="appendixa-1" lang="en"><gentext type="text">Appendix A</gentext><gentext type="toc">A.  Codeset Conversions</gentext><title lang="en">Codeset Conversions</title><sect1 id="appendixa-2" lang="en"><title lang="en"><gentext type="text">A.1 </gentext>Codeset Conversions</title><para lang="en">The following table provides a detailed listing of the supported code
conversions.</para><note lang="en" role="note"><gentext type="text">Note - </gentext><para lang="en">Unicode* includes all of the following codesets: UTF-8, UCS-2, UCS-2BE,
UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-16, UTF-16BE, UTF-16LE.</para><para lang="en">ISO 8859 codesets can also be referenced without the ISO prefix; for
example, ISO 8859-1 = 8859-1.</para></note><table frame="topbot" id="appendixa-tbl-7" lang="en"><gentext type="text">Table A-1 </gentext><title lang="en">Supported code conversions</title><tgroup cols="3" colsep="0" rowsep="0" lang="en"><colspec colname="colspec0" colwidth="18.86*"/><colspec colname="colspec1" colwidth="26.93*"/><colspec colname="colspec2" colwidth="53.20*"/><thead lang="en"><row rowsep="1" lang="en"><entry lang="en"><para lang="en">Code</para></entry><entry lang="en"><para lang="en">Code</para></entry><entry lang="en"><para lang="en">Description</para></entry></row></thead><tbody lang="en"><row lang="en"><entry lang="en"><para lang="en">Unicode*</para></entry><entry lang="en"><para lang="en">ISO 646</para></entry><entry lang="en"><para lang="en">Unicode* &lt;--&gt; ISO 646 (ASCII) </para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-1</para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-1 (Latin-1)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-2</para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-2 (Latin-2)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-3</para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-3 (Latin-3)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-4</para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-4 (Latin-4)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-5 </para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-5 (Cyrillic)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-6 </para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-6 (Arabic)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-7 </para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-7 (Greek)</para></entry></row><row lang="en"><entry lang="en"><para lang="en">Unicode*</para></entry><entry lang="en"><para lang="en">ISO 8859-8 </para></entry><entry lang="en"><para lang="en">Unicode* &lt;--&gt; ISO 8859-8 (Hebrew)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-9</para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-9 (Latin-5)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-10</para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-10 (Latin-6)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-13</para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-13</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-14</para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-14</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO 8859-15</para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt;
ISO 8859-15</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">Unicode*</para></entry><entry colname="colspec1" lang="en"><para lang="en">KOI8-R, KO18-U, koi8-r, koi8-u</para></entry><entry colname="colspec2" lang="en"><para lang="en">Unicode* &lt;--&gt; KOI8-R, KO18-U, koi8-r, koi8-u (Cyrillic)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-7</para></entry><entry colname="colspec1" lang="en"><para lang="en">UCS-2, UCS-4, UTF-8</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-7 &lt;--&gt;
UCS-2, UCS-4, UTF-8</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">UCS-2, UCS-4, UTF-16</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt;
UCS-2, UCS-4, UTF-16</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE, UTF-16BE, UTF-16LE</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt; UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE,
UTF-16BE, UTF-16LE</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UCS-4, UCS-4BE, UCS-4LE</para></entry><entry colname="colspec1" lang="en"><para lang="en">UCS-2, UCS-2BE, UCS-2LE, UTF-16, UTF-16BE,
UTF-16LE</para></entry><entry colname="colspec2" lang="en"><para lang="en">UCS-4, UCS-4BE, UCS-4LE
&lt;--&gt; UCS-2, UCS-2BE, UCS-2LE, UTF-16, UTF-16BE, UTF-16LE</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">UTF-EBCDIC</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt;
UTF-EBCDIC</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">IBM-037, -273, -277, -278, -280 -284, -285, -297, -420 -424, -500, -850, -852
-855, -856, -857, -862 -864, -866, -869, -870 -875, -880, -921, -922 -1025,
-1026, -1046, -1112, -1122</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt; various IBM code pages (PC and EBCDIC)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">CP850, CP852, CP855, CP857, CP862, CP864, CP866, CP869, CP874, CP1250, CP1251,
CP1252, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, CP1258</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt; various Microsoft code
pages </para></entry></row><row lang="en"><entry lang="en"><para lang="en"> UTF-8</para></entry><entry lang="en"><para lang="en">eucJP</para></entry><entry lang="en"><para lang="en">UTF-8 &lt;--&gt; Japanese EUC (JIS X0201-1976,   JIS X0208-1983 and JIS
X0212-1990)</para></entry></row><row lang="en"><entry lang="en"><para lang="en">UTF-8</para></entry><entry lang="en"><para lang="en">PCK</para></entry><entry lang="en"><para lang="en">UTF-8 &lt;--&gt; Japanese PC Kanji (a.k.a. SJIS)</para></entry></row><row lang="en"><entry lang="en"><para lang="en">UTF-8</para></entry><entry lang="en"><para lang="en">ISO-2022-JP</para></entry><entry lang="en"><para lang="en">UTF-8 &lt;--&gt; Japanese MIME charset</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en"> UTF-8-Java</para></entry><entry colname="colspec1" lang="en"><para lang="en">eucJP</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8-Java to Japanese EUC (JIS X0201-1976,   JIS X0208-1983 and JIS
X0212-1990)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8-Java</para></entry><entry colname="colspec1" lang="en"><para lang="en">PCK</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8-Java to Japanese PC Kanji (a.k.a. SJIS)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8-Java</para></entry><entry colname="colspec1" lang="en"><para lang="en">ISO-2022-JP.RFC1468</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8-Java to Japanese MIME charset (one-way conversion)</para></entry></row><row lang="en"><entry lang="en"><para lang="en">UTF-8</para></entry><entry lang="en"><para lang="en"> ko_KR-euc</para></entry><entry lang="en"><para lang="en">UTF-8 &lt;--&gt; Korean EUC (KS C 5636 and   KS C 5601-1987)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">ko_KR-johap</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt;
Korean Johap (of KS C 5601-1987)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">ko_KR-johap92</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt;
Korean Johap (of KS C 5601-1992)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">ko_KR-iso2022-7</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt;
Korean MIME charset (ISO-2022-KR)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">ko_KR-cp933</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt;
IBM MBCS CP933 ko_KR-euc</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">gb2312</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt; Simplified
Chinese EUC (GB 1988-1980  and GB 2312-1980)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">iso2022</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt; Simplified
Chinese MIME charset   (ISO-2022-CN)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">GBK</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt; Simplified
Chinese GBK</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">zh_TW-euc</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt;
Traditional Chinese EUC   (CNS 11643-1992)</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">zh_TW-big5</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt;
Traditional Chinese Big5</para></entry></row><row lang="en"><entry colname="colspec0" lang="en"><para lang="en">UTF-8</para></entry><entry colname="colspec1" lang="en"><para lang="en">zh_TW-iso2022-7</para></entry><entry colname="colspec2" lang="en"><para lang="en">UTF-8 &lt;--&gt;
Traditional Chinese MIME charset   (ISO-2022-TW)</para></entry></row><row lang="en"><entry lang="en"><para lang="en">UTF-8</para></entry><entry lang="en"><para lang="en">zh_TW-cp937</para></entry><entry lang="en"><para lang="en">UTF-8 &lt;--&gt; IBM MBCS CP937</para></entry></row></tbody></tgroup></table></sect1></appendix></book>