Developing NLMs for an International Market

Articles and Tips: article

KELLY BRADFORD
Software Consultant
Novell Consulting Services

01 May 1995

This DevNote provides some additional background material to the Novell BrainShare 1995 presentation "Internationalizing your NLM." It will help you to start designing your applications to work with the Novell Internationalization strategy.

A subsequent DevNote will provide an example NLM to illustrate the concepts discussed here.

Introduction
Language Enabling
Making the User Interface Localizable
Locale Format Enabling
Character Encoding Scheme Enabling
Hardware Enabling

Introduction

On February 1, 1995, Novell announced the availability of NetWare 4.1 in French, Italian, German and Spanish. Besides the languages announced, in the coming months Novell will also provide NetWare 4 in Portuguese, Japanese, Korean and Chinese (traditional and simplified). Customers worldwide can now access and manage NetWare 4.1 using their preferred language, improving productivity for both users and administrators.

With the localized versions of NetWare 4.1, both users and administrators can quickly and easily select the language in which they use NetWare utilities, help screens, messages and online documentation. For small businesses, where the function of network administration is often an additional job responsibility, the ability for administrators to work in their native language can significantly improve office productivity. For large companies with offices and personnel spanning different countries and languages, NetWare 4.1 provides a single network that can be used and managed equally well worldwide.

The Case for Software Internationalization

Many customers have a multilingual working environment where the employees prefer to use software in their various native languages. For example, a company in Montreal is likely to have both French and English-speaking network administrators. Even though the administrators may understand both languages, they would prefer to work in their native language. These linguistic preferences have important cultural and political implications. In some areas, such as Quebec, a multilingual environment is even a legal requirement.

Novell currently receives about 50% of its revenue from markets outside the U.S. The fastest-growing market continues to be international. About 70% of the international revenue comes from Europe, but Japan is the fastest growing market. Novell's goal is to have 70% of its total revenue come from the international market by 1998. Third parties can join in expanding this market by making sure their NetWare applications compatible with Novell's internationalization strategy.

Simply stated, the international market is big and it's growing. The days are gone (or soon will be) when the international market will simply just accept an English version of a product. They will prefer products developed for their own native language and culture.

Internationalization has become a requirement for those developing software applications for the global market. This article describes tools and procedures that Novell has developed to aid the developer in converting existing and creating new NetWare applications for internationalization.

Language Enabling

Language enabling deals with issues surrounding human language expression in software. An important part of language enabling is the capability of retrieving messages for your program in one of any number of languages, and do it at run time.

The first rule of language enabling is this: isolate all translatable data from your software. Isolating translatable data from the software allows you to follow the Novell model of having language components that can be sold separately. It also ensures that localization of a software product is a translation effort, not an engineering effort. The easier you make things for the translator, the easier you make your own job as an engineer, because you get fewer questions and problems during localization.

Figure 1: Translatable data should be isolated in a user interface module.

Many User Interface Management Systems (UIMSes), such as C-Worthy, Microsoft Windows, OS/2 PM, and Macintosh, provide for isolation of translatable data. Novell has also developed a string resource manager called MsgTools that makes it easy for NLM and text-based utility developers to isolate their translatable data. MsgTools are available as part of the NetWare SDK.

The data in a program that should be isolated and translated is often called a resource. Various localizable resources are listed in the table below.

Resource	Description
Text strings	Informationalmessages, ABENDs, error messages, help text, etc.
Panels	Some UIMS havesupport for defining interaction panels (alsosometimes called dialogs, windows, and screens).An example of a panel in SYSCON would bethe login restrictions window. C-Worthy programmersmust define this panel programmatically bycomposing a number of screens. A similarpanel in Windows would be defined as a singleresource, called a dialog template.
Program control	Most UIMS supportshortcut commands (called accelerators in Windows and P.M.) for experienced users. For example, a word processor may use CTRL+Bto specify the beginning of bold text. Theseaccelerators are often mnemonically based,and should therefore be translated.
Graphics	Icons and bitmaps are also frequently culture-specificand must be "translated."

Text Strings

In the graphical user interface (GUI) environments, future versions of NetWare will take advantage of the architecture of the host environment, and put strings in resource-only DLLs. For all other environments, MsgTools will extract the strings from the source code and put them into a separate language module.

In any environment, the developer hands over the extracted strings to a translator. Annotating the strings is important before you give them to a translator. The more information you give the translator up front, the fewer headaches you will have later. Here is the sort of information that a translator needs:

Length Limits. Almost all messages will be longer in their translated form than they were in English. You should generally enable your software to handle variable-length messages, but if some messages must not exceed a certain length in translation, you must tell the translator.

Translatability. Messages often contain keywords or product names that should remain the same in translation. You should point these out to the translator. Sometimes quoted strings that do not contain user-viewable data get extracted from your source code along with the genuine message strings. These strings should also be marked for the translator, or, better yet, not sent to translation at all.

Format. Some messages are padded to begin or end in a certain column on the screen. If format is important for a message, you need to tell the translator how to format the translation.

Variables. Variables, such as %d and %s, must be annotated so that the translator knows if they are proper names, etc. You must also number each variable so that the translator can indicate how they are to be reordered - see "Allow Variables to be Reordered," below.

Context. When you extract messages from the source code, you also extract them from their context. It may be obvious when you are running a program, that "Restore Menu," for example, is the header on a menu, but out of context, the translator could just as easily interpret it as a command ("Restore the Menu"), and the resulting translation could be quite different. Do not be stingy about supplying general context notes for the translator; it will make the translator's job much easier, and the translation will be much more accurate.

Panels

A panel is a window that can contain many different interactive controls. Many UIMSes, especially those with graphical user interfaces, have panel support. All panels should obviously be isolated from the code. The translator will be required to use the panel editor that comes with the UIMS.

When there is no panel editor, the translator must have some contextual information to understand what strings compose the panels used in the application. Someone on the development team must create that contextual information.

Graphics

One might assume that icons and other graphic images are universal; however, this is definitely not the case. The following icons are a few that have varying meanings in different locales:

The folder icon, depicted as a file folder in the U.S., has little meaning in Europe, where documents are not stored in file folders.
Check marks are used in the U.S. to mark desired items; in Japan, the check mark indicates something that is bad or should be discarded.
To most middle-class, suburban Americans a policeman or a person in uniform implies protection and security; to users in many developing countries, the same icon could elicit feelings of corruption and brutality.
In the U.S., the owl is a symbol of wisdom; in Vietnam, owls are a symbol of stubborn stupidity.

Localization requires "translating" at least some of the icons in a program. This means that icons must be isolated and coupled like text strings. It also means that you will have to find translators who understand the cultural implications of iconic images. Good software translators should also be able to use the available resource editors for a particular GUI environment.

Coupling Code and Resources

Isolating your resources allows you to enable your software to run in many languages. To localize your software for a particular language, you must couple your localized resources with the executable code, a process called coupling or binding. Possible coupling points are as follows:

Compile/Assemble. Because compilation and assembly require engineering resources, compilers and assemblers should not be used to couple resources to executable code. Doing so makes the logistics of the localization process much more difficult (and many additional headaches will come back to the engineers).

Link. For the same reasons, linkers should not be used to couple resources to code. Therefore, do not use compilers or linkers to couple resources and code.

Static coupling. A utility either appends the resources to the executable image, or inserts the resources into a reserved area of the program. It is used in the following environments:

The Windows and PM resource compilers both append resources to an EXE or DLL.
The OS/2 Message Tool can either store strings in a separate message file or append them to the EXE.
On the Macintosh, there is no explicit coupling, but every program file has a resource fork.

Static coupling has many of the disadvantages of compiling or linking the resources and is therefore not recommended as a standalone coupling method.

Novell will use static coupling in conjunction with load-time coupling, however (see below). In this case, statically coupled (or linked-in) resources become the default in case external resource files cannot be found at load time.

Load. The program finds its resources and loads them into memory during program initialization. Load-time coupling facilitates multilingual support. There can be a different resource subdirectory for each language. The loading uses an algorithm to find the resources by checking a profile or an environment variable, and gets the resources from the appropriate subdirectory.

Load-time coupling also allows for flexible packaging. When customers want to use an additional language, they can simply order a language update, which contains only resources and no code.

Load-time coupling is used at Novell because it is the most flexible approach.

Run. Every time the program needs a resource, it loads it in from a resource file. Conceivably, a user could change the language at any moment during program execution. This has the disadvantage of causing disk operations during inconvenient (or impossible) moments. There does not seem to be a great demand for this approach.

Making the User Interface Localizable

Although message text isolation is a major issue, you must write software properly to work in other locales. Most likely you will not be available to help those who translate your program into various languages; therefore you must ensure that your program's user interface does not have characteristics that will hinder the localization process. We will examine some of the major considerations below.

Allow for Expansion of Strings

Not only is it imperative that the source code not contain hard-coded strings, but the length of those strings should not be hard-coded, either. When strings are translated, they will likely be longer than the original. Here are some possible translations:

Note: C = Number of characters

English	C	Spanish	C	German	C	French	C
authorized user list	20	lista de usuarios autorizados	29	Liste der berechtigtenBenutzer	32	liste d'utilisateurs autorisJs	30
backup	7	salvaguardia	13	Sicherung	9	sauvegarde	10
file	4	fichero	7	Datei	5	fichier	7
path	4	camino	6	Pfad	4	chemin	6
user	4	usuario	7	Benutzer	8	utilisateur	11

Your program must handle text string expansion due to translation. Every time your program displays a text string, you must ensure there is enough space for the string in memory and on the screen. You can allocate the needed space at run time based on the translated string length, or you can build in enough extra space from the start. You should also make sure that menus, tables, lists, forms, and so on, aren't too big for the screen when their text is expanded due to translation.

To estimate the space a given text string will need when translated into the various languages, multiply its English length by the expansion factor specified in the following table. These expansion factors have been derived from analysis of previous NetWare translations, and account for 95% of the cases. The remaining 5% of the cases required more expansion space.

Charactersin English string	Expansionfactor
1 - 5	Alwaysallow for 15 characters
6 - 25	2.2
26 - 40	1.9
41 - 70	1.7
71 +	1.5

Locale Format Enabling

Locale enabling is building into the software the ability to adapt on the fly to use the various locale-dependent formats for date, time, number and currency representation, as well as monocasing and collation sequencing.

Besides translation, your program must be able to format text in a locale-sensitive way, meeting the requirements of the user's locale. Locale format enabling must account for the following:

Date format
Time format
Number format (thousands separators, decimal indicators, negative signs)
Currency format
Monocasing
Collation sequence

Existing Methods

Many systems have some way of allowing users and administrators to set locale preferences. They also have an API so the developers can also allow a user interface to locale information. The following table presents an overview.

OS	Programming Interface	User Interface	Comments
DOS	DOS Functioncalls 0x38 (2.1 and later) and 0x65 (3.1 and later).	Add COUNTRY=statement to CONFIG.SYS file.	You cannotset the date, time, and number formats separately.You generally have to reboot to change the preferences.
Windows	lstrcmplstruprGetProfileStringGetProfileInt	"International"dialog from control panel	User initiallystates country, and all preferences (date,time, etc.) are set to that country's default.The user can also change individual preferences.
OS/2 PM		Control Panel	Get localepreferences from SYSTEM.INI . The valuesstored here are set by the control panel.
OS/2	DOSGetCtryInfo	COUNTRY= statement in CONFIG.SYS.	Not only canlocale preferences be set from the controlpanel, but it is also possible to set themin a manner similar to DOS by modifying CONFIG.SYS. Even if your program runs in text mode,you should get the locale info from SYSTEM.INI.

The NetWare operating system provides locale-specific text-formatting information in the LCONFIG.SYS file for use by NLMs. The NetWare internationalization services for character collation, comparison, date\time and numeric formatting make use of this information to achieve locale sensitivity. This is also true of the NWSNUT default character-compare and date/time services. Operations that you do by means of these routines should behave properly with respect to locale sensitivity.

Note: The LCONFIG.SYS information for Chinese, Japanese, and Korean locales currently doesn't provide a sort order for double-byte characters. However, unless it is critical that your program sort double-byte characters, you should still use the service routines mentioned above. This way, if it is decided in the future that NetWare will support double-byte character sorting, it can be accomplished by extending LCONFIG.SYS and modifying the above-mentioned service routines.

Monocasing

Monocasing is a term used to describe the conversion of a string with lower case letters to uppercase. Most engineers learn in college that the way to convert to uppercase is a simple arithmetic operation:

if ('a' <= c << c <= 'z')<
      c -= 0x20;

This works for a-z, but many languages use alphabetic letters with diacritic markers (such as H) and many of these letters have uppercase equivalents. In addition, this method is totally inadequate for double-byte characters.

Another potential pitfall is the use of 0x7F to mask the high-order bit when evaluating characters. Users often base this practice on the assumption that the evaluated characters are either lower ASCII or NetWare augmented wildcards (wildcards with the high bit set). The problem is that localized and user-entered strings can contain numerous characters whose high bit is set, including single-byte accented Roman letters, single-byte Japanese phonetic characters, and a vast number of double-byte (Asian) characters. Make sure you use high-bit masking only in contexts where the high bit is truly insignificant to your task.

The correct solution is to use monocasing tables for each locale. Some environments such as NetWare allow for this. For example, the NetWare function NWLstrupr does conversions using the locale's monocasing table.

These factors also directly affect case insensitive string compares.

Note: By NetWare design decision, double-byte characters should never be targeted for case conversion, even if the characters represent Roman (English) letters. Historically, most case-conversion routines don't check whether a given byte is part of a double-byte character. You must ensure, however, that your case-conversion routines skip over double-byte characters. For how to detect double-byte characters, see "Handling Double-byte Characters," later in this document.

Collation

Collation is the process of arranging lists of words according to some predetermined sequence. This process is commonly known as alphabetization in languages with alphabetic writing systems, such as English or Spanish. Like monocasing, collation is locale dependent. For example, in Danish the ligature E comes after Z. For languages with nonalphabetic writing systems, such as Japanese, collation is extremely complicated. Collation tables are needed for each locale, as well. The NetWare function for collation is NWLStrColl.

Date and Time Formats

There are a number of details concerning date format that should be customizable for a locale. The date format differs from locale to locale in date order and in separators used. You must not make assumptions about these details when implementing your software. In fact, the level of support can be quite detailed.

The following table shows some of the date considerations:

	Shortformat	Longformat	Order
U.S.	05/21/95	May 21, 1991	MDY
Spain	21/05/95	22 Mayo 1991	DMY
Japan	1995-05-21	5 05 21	YMD
Germany	21.05.95	21 May 1995	DMY

There are also some options for time formatting. In the U.S., the 12-hour clock is most common; many other countries typically use the 24-hour clock. Some locales use colons to separate hours and minutes and seconds; others use periods or comas.

Country	TimeFormat
U.S.	3:05:45 pm
France	15:05:45
Sweden	15.05.45
Switzerland	150545

An example of clever, but unacceptable programming is the following sprintf statement to display time (in this case "8:00 am"):

sprintf("%d:%02d %cm", 8, 0, 'a');

There are two invalid assumptions here. First, the statement assumes the time separator is a colon. Second, it assumes a twelve-hour clock. In Netware the correct solution is to use NWLstrftime. This function formats the time and date according to the locale format.

Locale's Numeric Formatting

The display of numeric data also changes according to locale. The characters used for separating thousands and the decimal separator character are locale dependent. The format for displaying negative numbers is also locale dependent.

The following table shows a few different preferences:

Country	Positive	Negative
Finland	100 000,00	100 000,00-
France	100 000,00	-100 000,00
Italy	100.000,00	-100.000,00
Japan	100,000.00	-100,000.00
U.S.	100,000.00	-100,000.0

In NetWare, use NWstrnum. This function formats a number for a specific locale.

Locale's Monetary Formatting

Like numeric formatting, you need to show monetary values according to the locale preference. The characters used for currency, thousands and decimal separations are locale dependent. Also, the format for displaying negative numbers is also locale dependent. The following table shows a few different preferences:

Country	Currency	Positive	Negative	InternationalFormat
Finland	Markka	100 000,00Mk	Mk 100 000,00- Mk-100 000,00	FIM 100 000,00
France	French Franc	100 000,00F	-100 000,00F	FRF 100 000,00
Italy	Lira	L. 100.000,00	-L. 100.000,00	ITL. 100.000,00
Japan	Yen	- 100,000	--100,000	JPY 100,000
U.S.	Dollar	$100,000.00	-$100,000.00	USD 100,000.00

NetWare has two monetary APIs: NWstrmoney and NWstrImoney. NWstrmoney returns the local-sensitive money format for a numerical value. NWstrImoney obtains the country prefix and money format (international format) for a numerical value.

Character Encoding Scheme Enabling

Character encoding scheme enabling is where written human language meets machine. Language and locale symbols and ideographs are quantified and represented as numbers (code) that the computer can process. Character encoding falls into two areas:

Single byte character set (SBCS)
Double byte character set (DBCS)

DBCS could be better described as "multi-byte enabling," because these languages can have both single and double byte character sets.

Gone are the days when you could use 7-bit ASCII to represent all the characters you would ever have to use in a computer. Even the more recent 8-bit (single-byte) character sets are insufficient to handle the scripts used by many of the world's languages. If you want to handle Asian scripts, you must handle double-byte character sets (DBCS), as well.

There are a number of different ways of representing language in written form. These methods can be categorized as follows:

Alphabetic Scripts

Alphabetic scripts use phonetic symbols called letters to represent vowel and consonant sounds. Alphabets have relatively few symbols, thus allowing them to be represented by single-byte character encoding schemes. For example, the English alphabet only has 26 different symbols.

Many alphabets also have special marks called diacritics, which are placed above or below certain letters of an alphabet. Diacritics generally change the sound of the letter they modify or indicate intonation patterns. Common diacritics are the acute accent ('), the grave accent (`) and cedilla ( ) in French; the tilde (~) in Spanish; and the dieresis or umlaut ( &ml; ) in German.

Examples of common alphabets include the following:

Latin. Used by most European languages such as English, German, French, and Swedish. Also used by many other languages, such as Polish, Vietnamese, Tagalog, Turkish, and Swahili.

Cyrillic. Used by Russian and other Slavic languages such as Ukrainian and Byelorussian. Also used by non-Slavic languages spoken in the area of the former U.S.S.R., such as Azerbaijani and Kazakh.

Greek. Used by Greek.

Arabic. Used by Arabic and other languages in Moslem areas, such as Persian and Urdu. Vowels are represented by optional diacritics.

Hebrew. Used by Hebrew and related languages, such as Yiddish. Like Arabic, vowels are represented by optional diacritics.

Chinese. Chinese is an ideographic language. Each ideograph, or character, represents an entire concept or word. Consequently, there are many thousands of characters in Chinese. Obviously the 256 slots possible in a single-byte character encoding scheme are not nearly enough to represent Chinese. Chinese requires a double-byte character set (DBCS) to represent all the characters.

There are two major versions of Chinese ideographs: traditional and simplified. The People's Republic of China has simplified many ideographs by reducing the number of strokes in the characters, and also has reduced the total number of official ideographs. The Republic of China still uses traditional ideographs and opposes simplification.

Chinese can be written vertically, but modern Chinese is written horizontally, either left-to-right or right-to-left.

Many Chinese characters are also used in Japanese and Korean. The characters are written the same and have the same meaning, but they are pronounced differently in each of the three languages.

Japanese. In addition to Chinese characters, called Kanji in Japanese, the Japanese language uses three additional writing systems: Hiragana, Katakana and Romanji.

Hiragana and Katakana are syllabaries (each symbol represents a syllable) and were developed by Japanese scholars to represent linguistic features of Japanese that couldn't be represented with ideographic characters. Hiragana is used for particles, inflections, and some Japanese words. Katakana is used for foreign words adopted into Japanese.

Romanji is the Japanese word for our Latin alphabetic characters. Romanji is commonly used for acronyms and company names.

Typical Japanese text employs all four systems: Kanji, Hiragana, Katakana, and Romanji. It is not uncommon to find all four systems on a single page. In machine-readable documents, it is not uncommon to mix single-byte and double-byte characters.

Japanese is preferably written vertically, but computer displays and technical documentation are written horizontally, left-to-right. Computer printers do print characters vertically.

Korean. In addition to Chinese characters, Korean uses a syllabic script called Hangeul or Hangul. These two writing systems are often used together in the same text. Korean can be written vertically or horizontally.

Code Pages

IBM has devised the Graphic Character Global Identification System (GCGIS), which provides a system for referring to the myriad IBM encoding schemes and the different characters used in those schemes. The two most important concepts to understand are Code Point and Code Page. A code point is the assignment of a number to a character. For example, on the PC, the letter "A" has a code point of 0x41. A code page is a collection of code points for a given character set.

IBM initially created hundreds of code pages by delegating the definition of these code pages to their country representatives. IBM is now trying to consolidate by retiring many code pages and keeping those that meet the needs of a group of locales, instead of one code page for each locale.

For more details on all the terms and concepts, as well as details on the most important code pages, refer to the National Language Support Reference Manual, Volume 2, published by the IBM National Language Technical Center.

Following are just a few of the IBM code pages:

437	PC U.S.
850	PC International
852	PC EasternEurope
500	EBCDIC International

PC Code Pages

Most engineers refer to the characters on the PC as ASCII characters. The truth is that PCs don't use ASCII - they use IBM code pages. Characters 0x20-0x7E happen to be the same as ASCII. None of the other characters have anything to do with ASCII at all.

When the IBM PC was developed, a prevalent character-encoding scheme was the ANSI 7-bit American National Standard Code for Information Interchange (ASCII). (The other was EBCDIC, IBM's usual preference for encoding characters.) The eighth bit of ASCII was reserved for parity. The IBM PC developers decided to use the eighth bit for character encoding, instead of parity. The additional 128 characters made available by using the parity bit were used for characters with diacritic markers and for box composition characters.

In addition, the range 0x01-0x2F, reserved for control characters in ASCII, was used for smiley faces and other dingbats. The result was IBM code page 437, still used by most PCs in the U.S.

Many engineers erroneously refer to the PC characters in the range of 0x80-0xFF as "extended ASCII." These characters are not related at all to ASCII, and they aren't "extended." PCs have always had 8-bit characters!

IBM soon discovered that it had not put in all of the special characters that it needed for its international markets, so it created additional code pages for locales that needed more characters, such as Portugal and Iceland. Later, an additional code page, CP850, was created to handle Western Europe (including Iceland and Portugal).

The only constant in PC code pages is that characters in the range 0x20-0x7E will remain the same. Any characters not in that range can change. This means that if you use line draw characters, you must not assume the characters you need have fixed numeric assignments. You must use loadable tables that have the values for the current code page. If you don't, you will find that your software will draw Katakana characters in Japan, instead of line draw characters. Don't assume characters are assigned to specific code points.

ASCII

In addition to the seven-bit standard, ANSI has also created an eight-bit ASCII standard. Eight-bit ASCII is not used by DOS or OS/2, but it is used by Microsoft Windows.

When programming in Windows, there are cases where it is necessary to translate from ASCII to the PC code page. For example, if you get a file name from DOS, you should translate the name to ASCII before displaying it in Windows. Windows provides two functions to translate back and forth between ASCII and the PC code page: OemToAnsi and AnsiToOem.

Double-byte Considerations

Since there are thousands of ideographs, the CJK (Chinese, Japanese, Korean) countries obviously need more than 256 possible characters, which means that characters cannot be encoded with a single byte. Current industry practice is to assign two bytes to each character, which allows for 64K possibilities. This is known as a Double-byte Character Set (DBCS).

In actual practice, a single string may contain both double- and single-byte characters, and your enabled software must be able to handle it. This is known as Multi-byte Character Set (MBCS). There are at least two ways to accomplish MBCS: shift states and ranges.

Shift States. IBM EBCDIC systems (AS/400, S/370 family, Series 1, etc) use shift states to handle MBCS. Two single-byte values are reserved for signaling these states: 0x0E signals the beginning of a double-byte state, and 0x0F signals the return to a single-byte state.

The only other consideration for this type of string is that the you must assume a state for the first character in the string: is it a single-byte character, or the first byte of a double-byte character? This depends on the string type. If the string is defined as "Alphanumeric-Kanji either" (ANK either), the first byte of the string is single-byte. If it is "Ideographic Character either" (IGC either), the first byte is the first byte of a double-byte character.

Ranges. The method used on PCs is to reserve ranges of values for the first byte of a double-byte character. If a byte falls within the range, you are guaranteed that it is the first byte of a double-byte character. This approach is used on DOS, NetWare, Windows, and OS/2. There are a number of different ranges for different countries. These ranges are called DBCS Vectors.

Country	Range
Japan	0x81-0x9F,0xE0-0xFC
Korea	0x81-0xBF
SimplifiedChinese	0x81-0xFC
TraditionalChinese	0x81-0xFC

There are no set rules about what value the second byte of the double-byte character can be, except that it can't be 0. This means that null-terminated strings are still valid.

Obviously, there are certain coding practices that must be changed to handle multi-byte strings correctly. The main thing to keep in mind is that you must treat both bytes of a double-byte character as a single character - you can't split double-byte characters. This applies to displaying the characters, as well as to searching double-byte strings.

One common programming practice is to search backwards through strings containing a path name to find the path delimiter 0x5C, which is represented by a backslash in European code pages. This won't work if there are double-byte characters in the string. You can't tell by looking at the value of the single byte if you are looking at a single-byte character, or the second byte of a double-byte character. In Japanese, the code point 0x5C (displayed as a yen sign, -), which is the path delimiter if used as a single-byte character, can also occur as the second byte of a double-byte character.

Novell's locale APIs, found in NWLOCALE.H, contain functions to handle mixed-byte string manipulation. The APIs include NWCharType, NWLmblen, NWNextChar, NWPrevChar, and others.

Shift-JIS. Japanese Industrial Standards (JIS) has defined a double-byte encoding scheme for Japanese characters with two levels of support: a basic required level, and an optional second level. Together the two levels allow for more than 6000 characters. Most computers in Japan support both levels. These encodings allow for the kanas, Kanji, Latin, Cyrillic, and Greek characters.

JIS was designed with 7-bit ASCII in mind: each byte of the double-byte JIS characters is restricted to a 7-bit encoding, and the characters 0x00-0x20 and 0x7F are not allowed. This means that the lowest valid JIS character is 0x2121; the highest possible character is 0x7E7E.

When Microsoft implemented double-byte support for DOS, it created the range approach, which has already been explained. For Japan, Microsoft mapped the JIS characters to the ranges of valid double-byte characters. Because the JIS characters have been "shifted" in the double-byte table, this approach is called Shift-JIS. The term has nothing to do with the SO/SI approach used on Mainframes. Shift-JIS character 0x8140 corresponds to JIS 2121, 0x8240 to JIS 2122, and so on.

Unicode. Unicode is the result of a consortium of key computer companies: Novell, IBM, Digital, Microsoft, Apple, Lotus, NeXt, GO, Xerox, Sun Microsystems, and others. Unicode's goal is to create a universal, simple way of encoding characters. In Unicode, each character is encoded using sixteen bits, no matter what the character. This makes it possible to have more than 64,000 different characters. In fact, the current 16-bit UNICODE standard is a subset of the 32-bit UNICODE standard under development.

Among the scripts handled in Unicode are Latin, Greek, Cyrillic, Georgian, Tibetan, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Thai, Lao, Chinese, Japanese, Arabic, and Hebrew.

Unicode has two key advantages: universality and simplicity. It has universality because of the large number of different writing systems which can be represented in Unicode. It has simplicity because each character is represented as a sixteen-bit entity.

Parsing strings in Unicode is much simpler than in multi-byte character strings, because every character has the same width; Unicode strings are as easy to handle as single-byte strings, except they have the additional advantage of allowing for a wide range of characters. There also is no need to determine the active code page.

The ranges of possible character assignments in the 64k space have been partitioned as follows:

0000	Alphabets,syllabaries, and phonetic systems
2000	Punctuation,math symbols, operators, dingbats
3000	Chinese bopomofo,Japanese kana, Korean Hangeul
4000	Unified Charactercodes for Chinese/Japanese/Korean ideographs Future
F000	User-definedarea
FE00	Compatibility

For compatibility, the first 256 characters (0000-00FF) are exactly the same as 8-bit ASCII. This means that current Windows programs can translate to Unicode arithmetically.

One assumption of Unicode is that character encoding and representation are two different concepts. Unicode only deals with encoding; representation is the responsibility of the UIMS that display Unicode strings. Information about the font and character size are usually not dealt with in Unicode. For compatibility with existing encoding schemes, there are some cases where this assumption is violated.

Another feature of Unicode is character composition. Rather than encode each possible character/diacritic combination, it is possible to compose a diacritic character as a sixteen-bit character code followed by a sixteen-bit code for a non-spacing diacritic character. For example, the code for A (A with accent) would be

A	`
0x0041	0x0300

For compatibility with existing standards, there are also a number of precomposed characters. So A could also be coded with the single code point:

0x00C0

Unicode has also provided support for ideographic characters by assigning a single code to the similar characters. For example, the character for "sun" is the same in Japan and China. In most computer implementations, this character would have different code points in the Japanese and Chinese implementations. In Unicode, however, it only has one.

Novell, like many other companies, is in the process of implementing Unicode in its products. Unicode has been incorporated into NetWare Directory Services and will be further implemented into other areas of NetWare as Unicode becomes more widely accepted. Novell provides two functions to translate between Unicode and PC code pages: NWUnicodeToLocal and NWLocalToUnicode.

Handling Double-Byte Characters

As stated above, Chinese, Japanese, and Korean text contains a mixture of single-byte and double-byte characters. Therefore, your program's text handling operations must be able to detect the presence of double-byte characters when those languages are active. Here's the engineering procedure:

Use "Double-Byte Aware" String-Handling Routines . For character-level string operations like parsing, searching, comparing, wrapping, truncating, and so on, you must use routines that are sensitive to the presence of double-byte characters. These routines query the NetWare operating system to determine whether a double-byte character set is currently being used. If it is, they determine the range of character codes reserved for the leading byte of double-byte characters, and use that information to detect double-byte characters in strings.

The NWSNUT input, display, and editing functions meet this requirement, although they require that you specify string lengths in bytes, not characters. The NetWare locale functions defined in NWLOCALE.H are also double-byte aware. Here's an example that illustrates the use of NWLstrcspn and NWCharType. The task is to strip carriage return and line feed characters from a string.

LONG DisplayStringCopy( /* routine to strip CR LF from string */

void *sourceAddress,

void *destinationAddress,

LONG numberOfBytes)

{

 LONG indexCR, minIndex, len, offset, offset1, i;

 char copy[255];

 char charset[3] = {13, 10, '\0' };

 CMovB( (BYTE *)sourceAddress, copy, numberOfBytes );

 copy[ numberOfBytes ] = 0;

 /* skip over all CR and LFs */

 len = numberOfBytes;

 offset = 0;

 offset1 = 0;

 i = 0;

 while ( i < numberOfBytes ) {<
   /* find first occurence of CR or LF */

   indexCR = NWLstrcspn((char *)copy+offset, charset );

   if ( indexCR == (numberOfBytes - offset) ) {

     CMovB( (char *)sourceAddress+offset,

            (char *)destinationAddress + offset1,

            numberOfBytes - offset );

     i += numberOfBytes - offset;

     offset += numberOfBytes - offset;

     offset1 += numberOfBytes - offset;

   }

   else {

     minIndex = indexCR;

     /* copy everything up to CR or LF */

     CMovB( (char *)copy+offset,

            (char *)destinationAddress + offset1,

            minIndex );

     offset1 += minIndex;

     /* skip over CR or LF */

     if ( NWCharType( copy[indexCR] ) == NWDOUBLE_BYTE ) {

       minIndex+=2;

       len -= 2;

     }

     else {

       minIndex++;

       len--;

     }

     i += minIndex;

     offset  += minIndex;

   }

 }

 return( len );  

}

Note: By design, NetWare doesn't support double-byte characters in server names, volume names, Y/N responses, or drive letters. Server names include names of file servers, print servers, mail servers, and any other servers that broadcast a service on the network. In these contexts, you can assume that all characters are single-byte.

Eliminate Single-byte Specific Programming Techniques. Not only should you use the double-byte aware routines mentioned above, but you should also use double-byte aware algorithms when performing character-level string operations "manually" (without the use of a library call).

Note: For the purposes of isolating double-byte bugs, it is better to use the above-mentioned routines as much as possible, and avoid writing your own double-byte aware code. If CLIB doesn't provide all the functions you need, try to get it extended before you resort to writing your own code.

When "manually" processing a string that might contain double-byte characters, always start by checking whether the first byte of the string falls within the lead-byte range for double-byte characters. (Any other byte in the string could be the trailing byte of a double-byte character.) Once you have tested the first byte, you can advance one or two bytes (depending on the result) to get to the next character. In this manner, you can safely proceed from character to character until you find the target character.

For example, the following code illustrates how to find a path delimiter 0x5C in a mixed single- and double-byte character string:

extern int IsDBCSLeadByte(unsigned char); /* 1 = DBCS lead byte

                     0 = Otherwise      */

unsigned char *FindBackSlash2(unsigned char *szString)

{

  unsigned char *ch;

  ch = szString;

  while (*ch && *ch != &0x5c=)&
  {

      if (IsDBCSLeadByte(*ch))

          ch++;

      ch++;

  }

  return ch;

}

On the other hand, if you were to scan for the path delimiter 0x5C without checking against the lead-byte range, and then (only after finding 0x5C) check whether the preceding byte was in the lead-byte range, your algorithm would be faulty, as shown here:

/* Wrong way to do it! */

extern int IsDBCSLeadByte(unsigned char); /* 1 = DBCS lead byte

                     0 = Otherwise      */

unsigned char *FindBackSlash1(unsigned char *szString)

{

   unsigned char *ch;

   ch = szString;

   while (*ch)

   {

      if(*ch == >0x5C=)>
      {

          if (!IsDBCSLeadByte(*(ch-1)))

              break;

      }

      ch++;

   }

   return ch;

}

Hardware Enabling

The five different standard hardware architectures used in Japan present special challenges for software internationalization. The document Guidelines for Enabling Software for Japan (to be printed in the next issue of DevNotes) discusses what you need to do to enable your software run on different hardware architectures.

Isolate the Hardware Interface

If your program will be used on personal computers with different hardware architectures, you should isolate your program's hardware access operations. Here's the engineering procedure:

Use NetWare Calls Whenever Possible

The easiest way to isolate hardware dependencies is to always access the hardware indirectly through a generic software interface, such as NetWare. For example, rather than directly updating video memory, you should make a NetWare call to do it. Hardware access operations done in this way are portable across all the personal-computer architectures supported by NetWare.

Isolate into Separate Modules when Appropriate

If your program is a device driver, make sure you isolate the hardware-specific portion of your code into a separate NLM.

* Originally published in Novell AppNotes

Disclaimer

The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.