Contained Within
Find More Documentation
Featured Support Resources
| Download this book in PDF
- CHAPTER 7
Writing Internationalized Code
- This chapter describes some specific steps that you should take to internationalize applications. The material is divided into four main topics: text and codesets, formatting and collation, user messages, and nonglobal locales.
Linking
- Some internationalization components depend on dynamic linking to function correctly. The default when compiling and linking in the Solaris environment is dynamic linking. Take care not to specify static linking.
Text and Codesets
Call setlocale()
- The SunOS system supports the POSIX/ANSI C function setlocale(), which initializes language and cultural conventions. Most applications should set the locale category LC_CTYPE except those not concerned with character interpretation, such as block I/O to disk or network. To control the dynamic handling of different codesets in an application, add these lines to your code:
-
-
#include <locale.h>
main() {
(void) setlocale(LC_CTYPE, "");
}
- Among other things, this ensures that European accented characters such as ö are correctly identified with an isalpha() library call. Note that the empty string argument indicates that the application should set its codeset according to the environment variable LC_ALL, LC_CTYPE, or LANG--in that order of precedence. If none of these environment variables is set, the default locale is C, which results in old-style UNIX behavior.
- LC_CTYPE affects the behavior of various ctype(3) library routines. The LC_CTYPE locale category may also affect other functions, including wide-character handling.
- In most cases library packages should rely on the programmer to call setlocale() inside the application. Applications that fail to call setlocale() would simply fail to get international features.
- To set all the above locale categories at the same time, use the LC_ALL argument to setlocale() instead of just LC_CTYPE. In practice, most applications should set the LC_ALL category once and for all.
Make Software 8-bit Clean
- Programs shouldn't alter the most significant bit of a char. The computer industry used this bit for parity many years ago, but it didn't work out well--data got corrupted because software ignored the parity bit. Now standards committees have decided to define 8-bit codesets, which means you have to clean up your code now. Here are some problems to look for.
- Code that explicitly uses the most significant bit for its own purposes is said to be "dirty". There may be valid reasons for altering the most significant bit, but dirty code often involves setting and clearing private flags:
-
-
#define INVERSE 0x80 /* bad practice */
char c;
c |= INVERSE;
- Find another way to encode this information. A trick used several times in the operating system was to extend this data type to be unsigned short or unsigned int, and later set the top bit of the new data type.
- Code that assumes characters are only seven bits long is dirty. Here's an example of masking off the most significant bit on the assumption it's just the parity bit:
- c = *(string+i) & 0x7F;/* bad practice */
- A useful exercise is to search your code for constants like "0x80", "0x7f", "0200", "0177", "127", and "128". These constants often highlight problematic code immediately, if such bit patterns are used in conjunction with character handling.
- Code that assumes a particular character range, such as:
-
-
if (c >= 'a' && c <= 'z')/* bad practice */
- must be corrected to:
-
-
if (islower(c))
- Use codeset independent routines found in <ctype.h> such as isalpha(), isprint(), and so on. Software should have been using these functions all along, as they were always needed for portability to IBM's EBCDIC codeset. The SunOS system also provides wide-character equivalents such as iswalpha() and
-
-
iswprint().
- Fix code that assumes characters fall in the range 0-127 by extending the range of such tables:
-
-
static int hashtable[127]; /* bad practice */
- For example, the above declaration would be better coded as follows:
-
-
#include <limits.h>
static int hashtable[UCHAR_MAX];
-
UCHAR_MAX is defined in <limits.h> on all ANSI C conforming systems.
Watch for Sign Extension Problems
- One issue that is sometimes invisible to the programmer is the way the C compilers default to using signed for all fundamental data types. This can sometimes cause substantial problems in both application and library code.
- Code that casts char to other lengths may be dirty. Because the char data type is signed in SunOS, when a char variable holds an 8-bit character that has the most significant bit set, sign extension takes place during assignment. Needless to say, a negative integer might cause problems later on:
-
-
int i;
char c = 0xa0;
i = c; /* i is now negative */
- Do not pass raw characters to functions that require short, int, or long arguments. This is bad practice because of the sign extension problem. For example, the following code is incorrect, as it produces a negative integer index into the C library __ctype table. This is because the functions are actually macros that generate stubs of in-line code, which assume the argument is an integer, and propagate the sign bit accordingly.
-
-
char ch;
isascii(ch);
- The code above could be written like this:
-
-
unsigned char ch;
isascii(ch);
- Watch for the use of unadorned chars. Unfortunately they have probably been used extensively throughout most code. It is therefore a nontrivial task to change all char data to unsigned char, especially as this might garner some lint or compiler warnings.
- So,
-
-
char ch;
ch = 0xA0;
- is better written as:
-
-
unsigned char ch;
ch = 0xA0;
- On the other hand,
-
-
char *cp;
while (isspace(*cp)) {
- is written as:
-
-
char *cp;
while (isspace((unsigned char)*cp)) {
- Although all this may sound like a lot of work, in many cases existing code executes correctly in 8-bit mode without any changes to the code. You are primarily looking for lazy coding habits that assume ASCII is the only form of character encoding available. When you fix problems, they are usually easy to test using the Compose key of the Type-4, Type-5, PC-AT101, and PC-AT102 keyboard.
- Note that the C compiler does not support 8-bit or multi-byte characters in object names--that is, names of routines, variables, and so forth--although it does allow you to initialize 8-bit or multi-byte data in strings.
Use ctype Library Routines
- As mentioned previously, text processing software must avoid hard-coded character ranges. Upper- and lower-case letters, punctuation marks, numeric digits, and spaces should be defined using library routines under <ctype.h>, rather than with hard-coded character ranges:
-
TABLE 7-1
| Routine | Character |
| isalpha(c) | Letter |
| isupper(c) | Capital letter |
| islower(c) | Lower case letter |
| isdigit(c) | Digit from 0-9 |
| isxdigit(c) | Hexadecimal digit from 0-f |
| isalnum(c ) | Alphanumeric (letter or digit) |
| isspace(c) | White space character |
| ispunct(c) | Punctuation mark |
| isprint(c) | Printable character |
| iscntrl(c) | Control character |
| isascii(c) | 7-bit character |
| isgraph(c) | Visible graphics character |
Formats
- Many different formats are employed throughout the world to represent date, time, currency, numbers, and units. These formats should not be hard-wired into your code. Instead, programs should call setlocale(), then the various locale specific format routines, leaving format design to localization work for each country or language.
Time and Date Formats
- The secret to producing time and date formats valid in many locales is the strftime() library routine. First set the program clock by calling time(), then populate a tm structure by calling localtime(). Pass this structure to strftime(), along with a format for date and time, plus a holding buffer:
-
#include <locale.h>
#include <libintl.h>
#include <stdio.h>
#include <time.h>
main()
{
time_t clock, time();
struct tm *tm, *localtime();
char buf[128];
setlocale(LC_ALL, "");
clock = time((time_t *)0);
tm = localtime(&clock);
strftime(buf, sizeof(buf), "%c", tm);
printf("%s\n", buf);
}
|
- Recommended formats are %c for the local short form of date and time, or %C for the local long form. Also, %x produces the local date form (numeric), and %X yields the local time form. If you try out the program above, your results will look something like this:
-
% setenv LC_TIME de
% a.out
Mo, 16. Mär 1992, 19:19:19 Uhr PST
% setenv LC_TIME fr
% a.out
lun, 16 mar 1992, 19:19:20 PST
|
- Unfortunately many often-used combinations of date and time are missing from the standard. Neither short nor long form of the local date is available, and there is no abbreviation for time without seconds or time zone.
Currency and Number Formats
- Use localeconv(3) function to obtain currency formats. It reads formatting conventions of the current locale to populate an lconv structure, then returns a pointer to the filled-in object.
- The only way to properly represent monetary amounts using the facilities of Standard C is to laboriously build a string using information extracted from an lconv structure returned by localeconv(). Fortunately, XPG4 standardizes a function analogous to strftime(), named strfmon(), whose behavior depends on the LC_MONETARY category. This program uses strfmon() to format monetary amounts.
-
#include <locale.h>
#include <monetary.h>
#include <stdio.h>
int main()
{
double cost;
char buffer[100];
setlocale(LC_ALL, "");
scanf("%lf", &cost);
strfmon(buffer, sizeof(buffer), "%n\t%i", cost, cost;
printf("%s\n", buffer);
}
|
- As with strftime(), the formatted string is placed in a buffer. The %n format item formats the amount in the locale's national format, and %i uses the international currency code specified in ISO 4217.
-
% echo 12345.678 | env LANG=en_US a.out
$12,345.68 USD12,345.68
% echo 12345.678 | env LANG=sv a.out
12.346 kr 12.346 SEK
|
- The behavior of the %f format item for scanf() and printf() is affected by the LC_NUMERIC category. Swedish uses a comma (,) as the radix character and a period (.) as the thousands separator, so scanf() expects a comma where an English speaker would use a period. Be careful here: scanf() in the Swedish locale (or any similar locale) will stop reading upon encountering a period, just as it would stop at a comma in the C locale.
-
Note - The material in this section is used with permission from Creating Worldwide Software: Solaris International Developer's Guide, 2nd edition by Bill Tuthill and David A. Smallberg, published by Sun Microsystems Press/Prentice Hall. 1997.
Collation
- For string collation, sort orders may vary for different languages. Programs should use the strcoll() or strxfrm() library routine to perform string comparisons, which use locale-specific collation order.
Replace strcmp() with strcoll()
- Alphabetic ordering varies from one language to another. For example, in Spanish ñ immediately follows n, and digraphs ch and ll immediately follow c and l, respectively. In German the ligature ß is collated as if it were ss. Swedish has additional unique characters following z. Danish and Norwegian have additional characters æ, ø following z.
- The traditional library routine for comparing strings, strcmp(), remains unchanged. Because it uses ASCII order, strcmp() places "a" after "Z" even in English. This ordering is often unacceptable.
- By contrast, the new library routines strcoll() and strxfrm() can produce any sort order you want. Use strcoll() to compare strings, or strxfrm() to transform strings to ones that collate correctly.
- Fortunately strcoll() takes the same parameters and returns the same values as strcmp(). Unfortunately strcoll() does a lot more work, and is consequently slower. To speed up applications that compare strings frequently, use strxfrm() to store transformed strings into arrays that collate more efficiently.
- This program reads standard input, builds a binary tree in the correct order using strcoll() to compare strings, then prints out the binary tree. This code may be used for tasks such as listing files in a subwindow.
-
#include <locale.h>
#include <stdio.h>
#include <string.h>
|
-
struct tnode { /* node of binary tree */
char *line;
int count;
struct tnode *left, *right;
};
main() /* collate: sort a list of lines using strcoll() */
{
struct tnode *root, *tree();
char line[BUFSIZ];
root = NULL;
(void)setlocale(LC_ALL, "");
while (fgets(line, BUFSIZ, stdin))
root = tree(root, line);
treeprint(root);
}
struct tnode *
tree(p, line) /* install line at or below tree pointer */
struct tnode *p;
char *line;
{
char *cp, *malloc(), *strcpy();
int cond;
if (p == NULL) {
p = (struct tnode *)malloc(sizeof(struct tnode));
if ((cp = malloc(strlen(line)+1)) != NULL)
strcpy(cp, line);
p->line = cp;
p->count = 1;
p->left = p->right = NULL;
}
else if ((cond = strcoll(line, p->line)) == 0)
p->count++;
else if (cond < 0)
p->left = tree(p->left, line);
else /* cond > 0 */
p->right = tree(p->right, line);
return(p);
}
treeprint(p) /* print tree recursively starting at p */
struct tnode *p;
{
if (p != NULL) {
treeprint(p->left);
|
-
while (p->count--)
printf("%s", p->line);
treeprint(p->right);
}
}
|
Messaging for Program Translation
- One of the most critical tasks in software internationalization is providing messages that can be translated easily. Messages are what users see first: help text, button labels, menu items, usage summaries, error diagnostics, and so forth.
- This chapter shows you how to write an application that produces internationalized messages. Your program consults an external catalog of messages to determine what strings to present to the user. You provide one message catalog for each locale you support, but you have only one version of the program.
- The ease of message localization can vary greatly. In a well-designed application, nontechnical people can translate message files into their native languages. In a noninternationalized application, engineers fluent in a language must translate every explicit string that will be seen by a user, then recompile the code. In an internationalized application, a lookup function retrieves any such string from a message catalog: a database of text strings that is easy to compose, translate, and access. Because the contents of a message catalog are separate from application code, text can be selected by locale at runtime without altering the code itself.
- Two similar (but incompatible) methods for international messaging in Solaris are catgets() from the XPG4 standard and gettext() from the POSIX.1b and UniForum proposals. The primary difference between them is the way that messages in the catalog are indexed: in essence, you pass catgets() a message number, but you pass gettext() a string.
- If there are two messaging schemes to choose between, which should you use? Each has its strengths and weaknesses, and adherents to argue for it. There's a lot to be said for standardization, though. X/Open considered both and chose catgets(). For maximal portability of your application to other platforms, then, we recommend that you use that scheme.
- This section presents the issues involved with messaging:
-
-
Messaging Using catgets()
- When creating internationalized applications, developers usually write text strings (error messages, text for buttons and menus, and so forth) in their native language, for later translation into other languages. Solaris lets you use any language as native.
- Here are the steps to internationalize and localize text handling:
-
- Change source code to #include <nl_types.h>, then call catopen() to open a message catalog and call catgets() to retrieve strings from the catalog.
- Extract native language text strings from the catgets() calls and store them in a source message catalog. You must assign each message a unique number that will appear in both the source catalog and any catgets() call that refers to that message.
- Translate the strings in the source message catalog into a target language.
- Transform the translated source message catalog into a binary message catalog, using the gencat(1) utility. Install the binary catalog.
Locating Message Catalogs
- After you have established the locale, you will want to open the appropriate message catalog immediately, so that any startup problems that produce error messages will do so in the proper language. Use catopen() for this:
-
#include <locale.h>
#include <nl_types.h>
nl_catd catd;
int main()
{
(void) setlocale(LC_ALL,"");
catd = catopen("demo", NL_CAT_LOCALE);
...
}
|
- The catopen() function looks for the message catalog according to these rules:
-
- The locale used is the value of LC_MESSAGES as established by setlocale(). (The only other choice for catopen()'s second argument is 0, meaning that locale used is the value of the LANG environment variable.)
- The first argument and the NLSPATH environment variable are used to locate the catalog. (If the first argument contains /, then LC_MESSAGES and NLSPATH are ignored; instead, the first argument is the absolute path name of the catalog. You almost never want to do this.)
- The NLSPATH variable is a colon-separated list of filename patterns, for instance:
-
/usr/lib/locale/%L/LC_MESSAGES/%N.cat:/tmp/%N.%L.cat
|
- In these patterns, catopen() replaces %N with its first argument, and %L with the prevailing locale. If the locale is set to French, for example, then catopen() uses the file named /usr/lib/locale/fr/LC_MESSAGES/demo.cat if it exists. Failing that, it will try /tmp/demo.fr.cat. The first pattern in this example is the same one that catopen() uses if NLSPATH is not set. The second pattern is one a developer might use while testing an application's messaging ability.
- Although you need not name a message file after its application, this convention is recommended. It simplifies maintenance to have catopen()'s first argument be the same as the application name.
- The header <nl_types.h> defines the (integral) type nl_catd. The return value of catopen(), a catalog descriptor, should be stored in a variable of this type, since it will be passed to every catgets() call that looks up messages in the selected catalog. Because you use this variable throughout a program, declare catd globally.
- If catopen() fails, it returns (nl_catd)-1. Of course, a good application should test for this and note the error. However, you can safely pass this failure value in calls to catgets(), which will simply return the default strings you provide instead of the localized strings.
- An open catalog consumes system resources: a file descriptor and some memory for indexes into the catalog. When your program exits, these resources are automatically released. If you want to release them explicitly, call catclose():
-
Using catgets()
- To retrieve strings from a message catalog, you call catgets(), passing it the catalog descriptor returned by catopen(), an index into the catalog to select the message string, and a default string to use instead if there's a problem. The index is the most troublesome part of the catgets() interface.
- In essence, to use catgets(), you must assign a number to each message your program will produce. This requirement alone accounts for the most noticeable change in appearance between a noninternationalized and an internationalized version of a program. It can also lead to a maintenance headache if these numbers are not well managed. The only support the XPG4 messaging scheme gives you is the ability to partition your messages into sets. You may, for example, decide that the button label "Edit" is message number 37 of set number 4. How many sets you use, and what you use them for, is up to you. On some projects, each developer uses a different set number; on others, each subsystem of an application is given its own set number.
- Here is an example of how to use catgets():
-
/* Assume catd is the return value of catopen() */
printf(catgets(catd, 3, 27, "Invoice\n"));
|
- If all is well, catgets() will retrieve message number 27 of set number 3 from the message catalog referred to by catd, returning a char * value pointing to the message. If there is no message 27 in set 3, or if there is no set 3, or if catd is -1, then catgets() returns its last argument, the default string. The intent is that message 27 of set 3 in the catalog is a translation of "Invoice\n"; if the translation is unavailable, the program will use the English "Invoice\n", since that's better than nothing.
- Although not true for Solaris, on some platforms catgets() returns a pointer to storage that may be overwritten on each call. This implies that for maximal portability, use or copy the value returned by one call of catgets() before you call it again:
-
char buffer[100];
char *p, *q;
/*
* This is not portable:
*/
printf("%s %s", catgets(catd, 1, 1, "Name"),
catgets(catd, 1, 2, "Age"));
/*
* This is not portable either:
*/
p = catgets(catd, 1, 1, "Name");
q = catgets(catd, 1, 2, "Age");
printf("%s %s", p, q);
/*
* This is portable, provided buffer is big enough:
*/
strcpy(buffer, catgets(catd, 1, 1, "Name"));
printf("%s %s", buffer, catgets(catd, 1, 2, "Age"));
|
Create the Source Message Catalog
- Once you know what your messages are, create a source message catalog for your native language. Suppose the following program fragment shows all the messages some program will produce:
-
printf(catgets(catd, 1, 1, "Hello"));
printf(catgets(catd, 3, 4, "Age: %d\n"), age);
makeButton(catgets(catd, 1, 4, "Quit"));
|
- XPG4 specifies a format for source message catalogs. For this program, here is a possible English source message catalog:
-
$ This line starts with "$ ", so it is a comment
$ We will use " as a delimiter for strings
$quote "
$ Notice that message numbers need not be in a contiguous range
$set 1
1 "Hello"
4 "Quit"
$ Notice that set numbers need not be in a contiguous range
$set 3
4 "Age: %d\n"
|
- After each $set line, list the messages in that set in increasing order of message number. The set groups themselves must also be in ascending order of set number. The header <limits.h> defines NL_SETMAX, the maximum set number allowed; NL_MSGMAX, the maximum message number; and NL_TEXTMAX, the maximum number of bytes in a message text. The gencat(1) manual page specifies the syntax of a source message catalog.
- Notice that the English message texts in the source catalog are the same as the default strings in the catgets() calls in the program. This is almost always the case, of course: if the English message catalog could not be located, then the default messages would be the same as if the catalog had been successfully opened.
- Whoever will be translating the messages in your catalog will probably not know the context in which those messages will appear. Usually, the translators will not be programmers, although you can expect that they will have some training in recognizing some common characteristics of message strings. For example, you can assume that in the following, the translators know that %s represents some string:
-
- However, you cannot assume the translator will know that the %s above will be replaced by a file name. In some languages, this may be significant, since the word for "opened" may be translated differently, depending on whether the element that
- can't be opened is a file, a window, or a network connection. To enable good translations, you should include comments in your message catalogs for any strings that might cause difficulty:
-
1 "%s cannot be opened."
$ %s is a file name
2 "Read"
$ This is a past participle, not a present tense verb
|
- The genmsg(1) utility for creating source message catalogs became available in Solaris 2.6. This utility examines a source program file for calls to catgets() and builds a source message catalog from the information it finds. Here is an example:
-
% cat example.c
...
/* NOTE: %s is a file name */
printf(catgets(catd, 5, 1, "%s cannot be opened."));
/* NOTE: "Read" is a past participle, not a
present tense verb */
printf(catgets(catd, 5, 1, "Read"));
...
% genmsg -c NOTE example.c
The following file(s) have been created.
new msg file = "example.c.msg"
% cat example.c.msg
$quote "
$set 5
1 "%s cannot be opened"
/* NOTE: %s is a file name */
2 "Read"
/* NOTE: "Read" is a past participle, not a
present tense verb */
|
- Running genmsg on the program source file named example.c produced a source message catalog named example.c.msg. By specifying the -c option with an argument of our choosing (we chose the string NOTE), we caused genmsg to include comments in the catalog. If a comment in the source program contains the string we specified, that comment will appear in the message catalog after the next string extracted from a call to catgets().
- You can use genmsg to automatically number the messages within a message set. Refer to the genmsg(1) manual page for more information.
Translate the Source Message Catalog
- For each language your application will support, you must have strings in the source message catalog translated to that language. For test purposes, you could change the message texts to a made-up language. Here's an example:
-
$quote "
$set 1
1 "XxxHelloyyY"
4 "XxxQuityyY"
$set 3
4 "XxxAge: %dyyY\n"
|
- These "translations" are readable by a tester who knows only English. The translated strings are longer than the English strings to simulate translation to a language where strings may be of a different length than in English. This lets you test to be sure that tables align, that button labels won't exceed the size of the button, and so forth. Another test file could be English with all the vowels deleted, to see if layouts are affected by shorter strings.
- The genmsg(1) utility has options that cause it to automatically transform message strings as it produces a message catalog.
Generate the Binary Message Catalogs
- For each translated source catalog, generate a binary message catalog. The binary catalog is the one your application will consult at runtime. Use the XPG4 gencat utility to generate the binary catalog. If your Korean source message catalog is named demo.ko.msg, you would say:
-
% gencat demo.ko.cat demo.ko.msg
|
- The second argument is the source catalog, and the first is the binary catalog that will be created. Having successfully produced the binary catalog, you can install it in its final destination (/usr/lib/locale/ko/LC_MESSAGES/demo.cat).
- While testing your application, you may not want to install the catalog in its production location; indeed, you may not have the permissions to do so. You can leave the binary catalog wherever you like, since you can set your NLSPATH so that
- your application can find the catalog. Someone who knows only English and wants to test the demo application in Italian might first "translate" the English source message catalog as in the previous section, and then do the following:
-
% gencat demo.it.cat demo.it.msg
% env LANG=it NLSPATH=/tmp/%N.%L.cat demo
|
- Italian locale rules will be used for date formats, collation, and so forth. However, the messages will still be readable by the Italian-illiterate tester, since they will be in English surrounded with "Xxx" and "yyY," rather than in Italian.
- If your application does not seem to be correctly finding the translated messages, as evidenced by your seeing the default strings or the wrong translated strings, consider the following common oversights:
-
- Did you establish the locale before you called catopen()?
- Are your NLSPATH environment variable and the arguments to catopen() correct? (For example, if the first argument to catopen() is "demo.cat" and NLSPATH is ./locale/%L/LC_MESSAGES/%N.cat, then catopen() will look for demo.cat.cat.)
- Are your catgets() calls referring to the right set and message numbers? If you added, deleted, or changed message numbers in your catgets() calls but failed to revise, regenerate, and reinstall your message catalog, the numbers may be out of sync.
Messaging Using gettext()
- Where catgets() uses numbers to index message catalogs, gettext() uses strings; that is the main difference in their approaches to the messaging problem.
- The steps for text handling using gettext() are similar to those for catgets():
-
- Change source code to #include <libintl.h>, then call textdomain() to open the message catalog and call gettext() to retrieve strings from the catalog. In releases of Solaris prior to 2.6, the object program must be linked with the -lintl flag.
- Use the xgettext(1) utility to extract native language text strings from the gettext() calls and store them in a source message catalog.
- Translate the strings in the source message catalog into a target language.
- Transform the translated source message catalog into a binary message catalog, using the msgfmt(1) utility. Install the binary catalog.
Locating Message Catalogs
- Use textdomain() to open a message catalog. The pathname of gettext() message catalogs must end with locale/LC_MESSAGES/domain.mo, where locale is the current locale--the value of LC_MESSAGES as established by setlocale()--and domain is the argument you pass to textdomain().
- Unless you call bindtextdomain() to change the domain, the complete path is /usr/lib/locale/locale/LC_MESSAGES/domain.mo. In fact, this is where Solaris system messages for libraries and utilities that use gettext() reside.
- This program fragment opens a message catalog named /usr/lib/locale/locale/ LC_MESSAGES/demo.mo:
-
#include <locale.h>
#include <libintl.h>
int main()
{
setlocale(LC_ALL,"");
textdomain("demo");
...
}
|
- Many applications do not require root permission for installation and thus cannot place their messages in /usr/lib/locale. Moreover, most applications need messages in their own directory hierarchy to simplify export across a network. So, most applications should use the Solaris routine bindtextdomain() to associate a path name with a message domain. Here's a sample invocation:
-
char *path;
#ifdef TEST
path = "/tmp";
#else
path = getenv("APPLICATIONHOME");
#endif
bindtextdomain("demo", path);
textdomain("demo");
|
- If you compile the program with TEST defined, then the catalog will be found in /tmp/locale/LC_MESSAGES/demo.mo; if TEST is undefined, the catalog will be found in $APPLICATIONHOME/locale/LC_MESSAGES/demo.mo.
Surround Strings with gettext()
- Although it is not portable, gettext() is much easier to use than catgets(). All you really have to do is go through your programs, enclosing literal strings inside gettext() calls. Here is demo.c, a short example:
-
#include <stdio.h>
#include <locale.h>
#include <libintl.h>
int main() /* demo.c */
{
(void) setlocale(LC_ALL, "");
bindtextdomain("demo", "/tmp");
textdomain("demo");
printf(gettext("Hello\n"));
printf(gettext("Goodbye\n"));
return 0;
}
|
- The first gettext() looks in the catalog /tmp/locale/LC_MESSAGES/demo.mo for the translated string corresponding to the English string "Hello\n". It returns a pointer to the translated string if it finds it; otherwise, it returns the index string "Hello\n". You compile the program with
-
% cc demo.c -o demo
% cc demo.c -o demo -lintl
|
- In the above example, demo.c -o demo is for Solaris 2.6 or later and demo.c -o demo -lintl is for versions of Solaris prior to 2.6.
- You can partition your messages among different domains. When you call textdomain(), you establish the domain used by all calls to gettext() until you next call textdomain(). If you want to change domain for just the next call of gettext(), use dgettext() instead. This would be appropriate for a library product, as it is the best way to ensure a known domain. (Library calling sequence cannot be guaranteed, since different domains may be mixed together at random.) The library developer chooses the domain name.
- The following two examples retrieve the same strings but have different effects on the text domain. The first example does not change the current text domain. The second example changes the current text domain to library_error_strings, then retrieves the alternate language string of wrongbutton.
-
message = dgettext("library_error_strings", "wrongbutton");
or
textdomain("library_error_strings");
message = gettext("wrongbutton");
|
Create the Source Message Catalog
- After writing an application, create a text domain by extracting gettext() strings and placing them in a file with the alternate language equivalent.
- Once you have enclosed all user-visible strings inside gettext() wrappers, you can run the xgettext command on your C source files to create a message file. This produces a readable .po file (the portable object) for editing by translators. For test purposes, you can use xgettext's -m option to simulate a translation by adding a prefix string to each message.
-
% xgettext -m TRNSLT: demo.c
% cat messages.po
domain "demo"
msgid "Hello\n"
msgstr "TRNSLT:Hello\n"
msgid "Goodbye\n"
msgstr "TRNSLT:Goodbye\n"
|
- The domain "domainname" line states that all following target strings until another domain directive belong to the domainname domain. Each msgid line contains the index string passed to gettext() and is followed by a msgstr line containing the translated string. The manual page for msgfmt(1) specifies the syntax of the .po file.
- If you anticipate translators having difficulty translating a message, comment it, using lines starting with #. An effective way to do this is to place comments for the translator into your application source code, then use the -c tag option of xgettext(1) to place these comments into the .po file.
Create the Binary Message Catalog
- Run msgfmt on the .po source file to produce a binary .mo file (the message object), which should be installed under the LC_MESSAGES directory. Here's a sample interaction on demo.c:
-
% msgfmt demo.po
% su
Password:
# mv demo.mo /usr/lib/locale/test/LC_MESSAGES
|
Problem Areas
Don't Overdo Messaging
- You should not blindly wrap every string literal in your program in a call to catgets() or gettext(). In general, you only need to message those strings that users see. Do not message strings containing system commands or file names, such as "sort" or "/dev/tty". Be careful when messaging strings inside sprintf(), which is often used to build up path names or command lines. You probably don't need to message strings used only for debugging. Because integers and decimal numbers are not strings, they don't need messaging, either.
Be Aware of Programming Language Restrictions
- Not every context allows you to replace a string literal with a call to a function. Converting the noninternationalized declaration
-
static char *greeting = "Hello";
|
- to
-
static char *greeting = catgets(catd,1,1,"Hello");
|
- produces an illegal C declaration. One way to fix it is:
-
static char *greeting;
int main()
{
/* establish locale and open catalog, and then: */
greeting = catgets(catd,1,1,"Hello");
|
- If this were a C++ program instead of a C program, the declaration with initialization would be legal. However, you must control the order of initialization of static objects so that greeting is not initialized until after the locale has been established and the message catalog opened.
Prepare for Variations in Text Length and Height
- If strings must be stored in an array, be sure to declare arrays large enough to hold any possible translation. Messages in German are often longer than in English; messages in Chinese may be shorter, even accounting for multibyte encoding. A good rule of thumb is that a string might double in length, although very short strings might be even longer in translation (for example, English "Edit" is German "Bearbeiten"). Use strncpy() to avoid overrunning an array:
-
strncpy(msg, catgets(catd,1,1,"Hello"), sizeof(msg));
|
- Displayed characters in translated messages may be of different length and height than the original messages. East Asian language ideographs are usually taller and wider than Roman characters.
- Window system resource files specify height and width of elements such as panel buttons. The AppBuilder and DevGuide tools employ these facilities. In some cases, it's best to use implicit object positioning, letting the window system decide where to place things. See Chapter 9 for more details.
Avoid Compound Messages
- Creating easily translated messages is an art form that involves more than just inserting catgets() calls around strings. Remember that word order varies from language to language, so complex messages can be very difficult to translate properly. A common-sense guideline is to avoid compound messages with more than two %s parts whenever possible.
- There are two approaches to messaging: static and dynamic. Static messaging involves looking up strings in a message catalog, with no reordering taking place. Dynamic messaging also involves looking up strings in a message catalog, but those strings are reordered and assembled at runtime. International standards provide an ordering extension to printf() for implementing dynamic messaging.
- The advantage of static messaging is simplicity. Use it whenever possible. However, avoid splitting strings across two printf() statements, which makes messages difficult to translate. Remember that the ANSI/ISO C preprocessor will paste together two consecutive string literals into one long literal:
-
/* bad */
printf(catgets(catd,1,1,"This is a very, very, very, very
"));
printf(catgets(catd,1,2,"long string that I want to
display");
/* good */
printf(catgets(catd,1,1,"This is a very, very, very, very "
"long string that I want to display"));
|
- Translation problems can arise with compound messages, especially when more than one sentence could be produced at runtime. Here is some code that would be difficult to translate:
-
/* poor practice: multipart compound message */
printf("%s: Unable to %s %d data %s%s - %s",
func, (alloc_flg ? "allocate" : "free"),
count, (file_flg ? "file" : "structure"),
(count == 1 ? "" : "s"), perror("."));
|
- Quite apart from being poor programming practice, this fragment of code would be much clearer to the reader and much easier to translate if it were split into separate print statements inside an if-else block that would select the correct message at runtime:
-
if (alloc_flg)
if (file_flg)
printf("Unable to allocate %d file\n", count);
else
printf("Unable to allocate %d structure\n", count);
else
if (file_flg)
printf("Unable to free %d file\n", count);
else
printf("Unable to free %d structure\n", count);
|
- The issue of making the objects plural is not addressed in this example because, in many languages, pluralization involves more than adding "s" to the end of a word.
Use Dynamic Messaging With Care
- Dynamic messaging is used when the exact content or order of a message is not known until runtime. Unless done carefully, dynamic messaging causes translation problems. If the positional dependence of keywords is hardcoded into a program, code needs to be changed before messages can be successfully translated. Obviously, this defeats the purpose of internationalization.
- XPG4 defines an extension to the printf() family that permits changing the order of parameter insertion. Solaris also supports this extension. For example, the conversion format %1$s inserts parameter one as a string, and %2$s inserts parameter two. The entire format string is parameter zero.
- Here's a small example of how these extensions can be used. This printf statement has position-dependent keywords because the verb must come before the object.
-
/* poor practice: position-dependent keywords */
printf("Unable to %s the %s.\n",
(lock_flg ? "lock" : "find"),
(type_flg ? "page" : "record"));
|
- This could produce any of four messages in English:
-
Unable to lock the page.
Unable to find the page.
Unable to lock the record.
Unable to find the record.
|
- Here are those four messages translated into German. Note that the verb ("sperren" or "finden") must follow, not precede, the object ("Seite" or "Rekord").
-
Das Programm kann die Seite nicht sperren.
Das Programm kann die Seite nicht finden.
Das Programm kann den Rekord nicht sperren.
Das Programm kann den Rekord nicht finden.
|
- German syntax requires different word order, so the program's keywords must be reversed. Here is that printf statement written for dynamic messaging:
-
printf(catgets(catd,1,1,"Unable to %s the %s\n"),
(lock_flg ? catgets(catd,1,2,"lock") :
catgets(catd,1,3,"find")),
(type_flg ? catgets(catd,1,4,"page") :
catgets(catd,1,5,"record")));
|
- The German message catalog would then appear as follows:
-
1 "Das Programm kann %2$s nicht %1$s.\n"
2 "sperren"
3 "finden"
4 "die Seite"
5 "den Rekord"
|
- This example might not work on other vendors' systems because of multiple catgets() calls within one expression.
- Consider carefully the effects of dynamic messaging. You might have to reposition parameters during translation. Often this fact isn't recognized until translation actually begins, by which time it's already too late--the software would have to be laboriously rereleased.
Manage Message Indices
- When you use the catgets() messaging scheme, you must ensure that you don't assign the same set number/message number combination to different messages. This can be a problem in a multiperson project. Here are some guidelines for managing the message numbers.
-
- Use a different message set number for each subsystem or for each developer. This localizes potential conflicts, making them easier to find and fix.
- Do not change a message number after it has been assigned to a message. If a message is deleted, do not reuse its number. This makes successive versions of a message catalog more consistent. Suppose that a localizer has already translated a source message catalog. If a new version of that catalog arrives for translation, much less work needs to be done if unchanged messages can be quickly identified.
- Use a tool to assign message numbers. An automated process is less likely to assign duplicate numbers than a manual one. The genmsg(1) tool that became available in Solaris 2.6 has an option that automatically numbers those messages in each set that have not already been assigned numbers.
- Appoint a central numbering authority. Making one entity responsible for managing message numbers helps ensure that consistent procedures are followed.
Other Programming Languages
- The Desktop Korn Shell, dtksh, in CDE has built-in catopen, catgets, and catclose commands. Here is an example:
-
catopen CATD demo
catgets msg1 $CATD 3 7 'Hello there'
catgets - $CATD 3 7 'Hello there'
catclose $CATD
|
- Using the LANG and NLSPATH environment variables and the name demo (which will be substituted for %N in NLSPATH), catopen opens the message catalog and sets CATD to the catalog ID. The calls to catgets look for message 7 of set 3, returning Hello there if it can't find it. The message is stored in the variable msg1 in the first call and written to standard output in the second. The catclose command releases the resources acquired by catopen.
- Solaris provides a gettext(1) command to retrieve translated messages from a catalog for use in shell programming. This command reads the TEXTDOMAIN environment variable for the domain name and the TEXTDOMAINDIR environment variable for the path name to the message database.
Summary
- To internationalize and localize text handling in an application, follow these steps:
-
- Decide whether you will use the standard catgets() scheme or the nonstandard gettext() scheme.
- Open the message catalog after establishing the locale.
- Call catgets() or gettext() to retrieve strings from the catalog.
- Extract native language text strings to form the source message catalog. Comment those strings that may cause translation difficulty.
- Translate the strings in the source message catalog into a target language.
- Transform each translated source message catalog into a binary message catalog.
- Install the binary message catalogs when you install the application.
-
Note - The material in this section is used with permission from Creating Worldwide Software: Solaris International Developer's Guide, 2nd edition by Bill Tuthill and David A. Smallberg, published by Sun Microsystems Press/Prentice Hall. 1997.
|
|