Localization and Internationalization

Now the whole earth had one language and few words. And as men migrated from the east, they found a plain in the land of Shinar and settled there. And they said to one another,"Come, let us make bricks, and burn them thoroughly", And they had brick for stone, and bitumen for mortar. Then they said said, "Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves, lest we be scattered abroad upon the face of the whole earth"

And the Lord came down to see the city and the tower, which the sons of men had built. And the Lord said, "Behold, they are one people, and they have all one language;and this is only the beginning of what they will do; and nothing that they propose to do will now be impossible for them.Come, let us go down, and there confuse their language, that they may not understand one another's speech"...

-Genesis

Introduction

The Software industry started in US,and traditionally software that allows only English input and output has been forced upon global users. This is based on this question from developers: "Why can't they all speak English?". However if a user is given two software products that offer same features, most people would choose one in their native language. Now a days software development centers are concentrated in countries like India, Australia, Mexico and Israel. As software is developed in different parts of world, successful companies believe that it is critical that their software products interact with their users in their native language and local conventions. Monolingual and mono cultural software products are not competitive.

Let us illustrate the problem of monolingual programming practice by an example. Below given is a program to check the valid inputs from a user.

    char c;
    //Get user input
    if((c>='A' && c<='Z') || (c>='a' && c<='z')){
    //Accept the input
    }else{
     //Error Handling
    }

What is the problem with this code? This code is wriiten in the assumption that users are going to input english text and letters are between A to Z. The above code will not work in Danish. Danish alphabet has 3 more letters after Z. This will frustrate Danish users. Now imagine what would happen if a chinese user tried to enter data?

Converting an American software package into a multi-lingual product or "Americanizing" foreign software entails more than translation. A properly localized software package allows the local user to exploit the software's power to do exactly the same things that the original software does for the original user, but according to the local user's own rules and conventions. The local user will not be distracted by imprecise or ambiguous features that can result from inadequate attention to cultural and linguistic differences, nor from the nature of the software and engineering constraints on equipment.

Around the world local conventions exist for number formatting, currency, dates, times, names, addresses, measurement, calendar. Besides local conventions software developer must consider cultural diversity issues related to numbers and colors. In Asian countries white denotes death in general but in western countries it is black. In Latin America death is denoted by purple. In US 13 is considered unlucky, 69 has sexual connotations, and 666 is the sign of the devil. In Hong Kong the number 7 is unlucky, while in India some people consider 7 as lucky number.

Even English is not same across the world. See the following examples:

   US english                      UK english  
    Aluminum                        Aluminium 
    Center                          Centre
    Internationalization            Internationalisation 
    Flavor                          Flavour
    Tire                            Tyre
    Elevator                        Lift
    Hood                            Bonnet
    Mutual Fund                     Unit Trust
    Pavement                        Road
    Sidewalk                        Pavement
    Trunk                           Boot

Mere translation of the strings will not solve the problem. See the following examples.

Here are signs from hotels around the world showing what can happen if you don't have knowledgeable translators:

 Please leave your values at the desk - France
 You are invited to take advantage of the chambermaid- japan
 Ladies are requested not to have children at the bar- Norway

Customized Localization

Developing multiple versions of the same software for different languages/cultures is known as Customized Localization. Many companies actually do this and it is very expensive to maintain. Any bug fix, feature addition requires replication of the same among all versions. Then all these versions should be tested independently. Definitely this would be a nightmare.

Internationalization

This is an alternative to customized localization, where the software is made in such a way that it supports all languages all over the world and same version can be sold anywhere in the world.

Internationalization is commonly known as I18N(Eye-Eighteen-En) in software industry.Internationalization begins with I and ends with N , and 18 characters in between.

  "Internationalization is not a feature"-Tom McFarland,HP

Usually people wont ask for internationalized softwares, but they expect that the software follows their local conventions correctly.

Gettext

Get text is the tool which is used for runtime internationalisation.
Internationalisation is achieved in gettext through the following phases:

1. Preparing source code for internationalisation
2. Extraction process
3. Translation Process
4. Compilation of translation
5. Retrieval of translation

Localization

The process of adjusting internationalized software to a particular locale is called localization.A common acronym for this term is L10N(10 characters in between L and N). You can think of software internationalization as a prerequisite for localization. The product is first localized for the market where the software is developed. Through localization, you are creating versions of the softwares without modifying source code or binary. This is achieved through externalization of text messages and locale specific images/bitmaps. People working on localization changes these texts/images according to their local language/conventions. A very simple example is given below in java(Technical details skipped for simplicity).

  public class HelloWorld{
    public static void main(String args[]){
       System.out.println("Hello world")
    }
  }

output: Hello world In the above code "Hello world" is hard coded. We need to externalize that string for localizing the application. Now see the internationalized code:

 import java.util.*;
 public class HelloWorld{
    public static void main(String args[]){
       ResourceBundle resources;
       try{
          resources=ResourceBundle.getBundle("messages");
          System.out.println(resources.getString("Hello world"));
        }catch(MissingResourceException e){
        //error handling code..
        }
    }
  }

And the resourcebundle messages_ml(ml is the ISO language code for Malayalam) contains the following text:

Hello world=ലോകമേ നമസ്കാരം

Now the output changes to

ലോകമേ നമസ്കാരം

Globalization

The term globalization (G11N) is often used synonymously with internationalization. But usually it encompasses both Internationalization and Localization. It is a process which involves design, implementation and localization.

The Importance of Localization

Currently, people who want to use computers must first learn English. In a country with low literacy rates, this blocks access to information and communications technologies (ICTs), especially for the rural poor and women who do not have equal access to education. Even after having learnt English, users must pay hundreds of dollars to license foreign software, or resort to widespread illegal copying of software, in order to gain access to ICTs. In short, access to information technology is one of the keys to development, and localized FOSS applications remain a crucial missing link in communications infrastructure.

Localization brings the following benefits:

Significantly reduces the amount of training necessary to empower end-users to use a computer system.
Facilitates the introduction of computer technology in Small and Medium Enterprises (SMEs).
Opens the way for the development of computer systems for a country's national, provincial and district level administration that will allow civil servants to work entirely in the local language and manage databases of local language names and data.
Facilitates the decentralization of data at provincial and district levels. The same applies to utility companies (electricity, water, telephone), who will develop local language databases, thereby reducing costs and giving better service to citizens.
Allows citizens to communicate through e-mail in their own language.
Empowers local software development companies to work for the administration, the public sector and private companies.
Provides the local design industry with good fonts.
Helps universities train more software engineers.

The beneficiaries of this multi-stakeholder project are:

Directly, all local computer users, who will have easier access to the use of computers as they will not have to learn English first.
Indirectly, through improvements in governance using native computer systems, all local citizens in the quality of their dealings with the administration.
The local government who will have the opportunity to develop databases and applications in the local language. Sufficient technology and empowered local development companies will be available. The government will also have the tool to coordinate applications among similar administrations (e.g., provinces), so that IT-based improvements in governance can be made at the lowest possible cost.
The software industry. The government's use of standards-compliant computer technology encourages software companies to start developing compatible computer systems that will be used by the different bodies of the administration, thereby creating a stable software industry in the country. Once this expertise is developed (using FOSS), these companies will be empowered to undertake similar projects for foreign companies at extremely competitive prices, facilitating sales beyond the local market.

Source: http://en.wikibooks.org/wiki/FOSS_Localization/Introduction

Culturally Biased wrong Assumptions

All letters are between A and Z: Does not hold good for non-English words
All scripts contains upper and lower case letters: Chinese, Indian, Korean, Japanese scripts does not have the concept of case.
Words are seperated by space:Korean and Thai don't have the concept of word separation
Punctuation is same:English uses ? for question mark. Spanish uses same sign, but upside down.
Text is written left to write: Arabic and Hebrew are bidirectional. Mongolian is written vertically from left to right. Chinese can be written left to right horizontally or right to left vertically.
All calendar systems are Gregorian: Thai government allows only Buddhist calendar for business.
Characters are eight bit: 8 bit character representation cannot hold all characters in the world.Usually 16 bit-Unicode is used
Words contains consonants and vowels:Arabic and Hebrew don't require vowels.

Rules of Thumb for Software Internationalization

Internationalized software must enable easy porting to other locales. A locale defines language and specific cultural conventions. The process of adjusting internationalized software to a particular locale is called localization (a common acronym for this term is L10N). You can think of software internationalization as a prerequisite for localization. Localization consists of more than just translating the user interface. Consider North America and Britain, for instance. Seemingly, they use the same language. However, not only do these locales differ in spelling (program vs. programme, realize vs. realise, color vs. colour etc.), certain cultural conventions such date formatting (in Europe the date format is DDMMYYY whereas in north America it's MMDDYYYY), currency, and measurement system. Other locales exhibit additional cultural differences. In Germany and other European countries, the sign of a decimal fraction (also called radix) is a comma, e.g., 10,5. In North America, the radix is called "decimal point", and as the name suggests, it's represented as follows: 10.5.

Locales use different character codesets (7-bit ASCII, EBCDIC, Unicode) and fonts (Latin, Hebrew, Cyrillic). There are a few basic guidelines to follow in order to ensure easy software localization:

Avoid any hard code literal text in your code. Instead, use string tables or environment variables.
Use wide characters instead of narrow characters. C and C++ support the wchar_t datatype. C++ also Compose decimal numbers and dates from dynamic lexical units. Such lexical units are strings that contain a locale-specific representation of a fraction sign, currency symbol, and date format separators.
Don't assume anything about text directionality. Semitic languages such as Arabic and Hebrew are written right-to-left, as opposed to European languages. Consequently, menus, frames and pages are aligned differently in such languages. Some Asian languages are written bottom-up.
Be ready to deal with several calendars. Other calendars such as the Muslim, Chinese and Hebrew calendars may be used in addition to the Gregorian calendar in certain locales.
Avoid any assumption about religious matters and holidays. For example, in non-Christian countries, December 25 is usually an ordinary business day and so is Sunday. An internationalized banking system should be ready to process transactions from foreign branches on Sunday, for example.

Locale

A locale denotes a specific language along with its conventional information such as date, currency, calendar, number format etc.It also includes the following:

Names of the months
Days of the week
First day of the week
Collation sequencing (Sort order)
Time Zone information

Writing Sytems

A writing system, or script is not a language; it is a means of conveying information through written language.They can be classified as follows.

Script Type

Alphabetic: Individual units for writing are composed of consonants, and in some cases vowels. When combined they spell out words phonetically. Eg: Indic, Arabic, Latic, Greek etc.

Syllabic: The individual units for writing are composed of syllables. Eg: Japanese kana and Korean Hangul

Ideographic: A writing system which uses pictures or symbols to represent words. Eg: Chinese

Context dependent Glyph Shaping

Positional: The shape of the character changes depending on the position in the word. Eg: Arabic greek.

Ligatures: Characters combine to form a different shape when they appear next to one another. In Indic scripts ligatures are mandatory.

Cursive: The letters are joined while writing. Arabic is an example.But English is not of this kind.

Text Direction

Left to right: Text is written left to right horizontally. Eg: Indic, English

Bidirectional: Examples are Arabic and Hebrew.Text is written right to left while numbers and latin words are written left to right.

Vertical: In Chinese and Japanese text is written vertically

Other Characteristics

Diacritics: Special marks used for accents, tones, and vowels, or to uniquely identify a character. In some writing systems such as Indic and Thai, diacritics can span multiple characters.

Word seperator: Most of the languages use space as word separator. Exceptions are Chinese, Thai , and Japanese

Punctuation: Marks are inconsistent across writing systems

A detailed description of above writing systems can be found at Wikipedia page on Writing Systems

Unicode

Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a character repertoire, an encoding methodology and set of standard character encodings, a set of code charts for visual reference, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and rules for normalization, decomposition, collation and rendering.

The Unicode Consortium, the non-profit organization that coordinates Unicode's development, has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments. Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including XML, the Java (programming language)|Java programming language and modern operating systems.

More details at

Internationalized Resource Identifiers

Internationalized Resource Identifiers (IRI) is also known as Multilingual Web Addresses. Currently Web addresses are typically expressed using Uniform Resource Identifiers or URIs. This restricts Web addresses to a small number of characters: basically, just upper and lower case letters of the English alphabet, European numerals and a small number of symbols. Recent developments enable you to add non-ASCII characters to Web addresses.

Detailed information is available from An Introduction to Multilingual Web Addresses

Input Methods

Input methods are applications or software components that convert users key strokes into symbols, characters or words.

An input method editor (IME) is a program or operating system component that allows computer users to enter characters and symbols not found on their keyboard. This, for instance, allows the user of a Western keyboard to input Chinese, Japanese, Korean and Indic characters.

This is intended as a non-exhaustive list of input methods for UNIX platforms.

Name	Languages supported	Implementations supported
SCIM	Multiple languages, including CJK	GTK+ , Qt and XIM
uim	Multiple languages, including CJK	GTK+, Qt, XIM, Leim, Tty (Unix) and TSM (Mac OS X)
xcin	Mainly for traditional Chinese; adapted for use for simplified Chinese.	XIM
InputKing	Traditional Chinese and simplified Chinese.	Browser based.
im-ja	Japanese	GTK+ and XIM
kinput2	Japanese	XIM, kinput2 protocol
ami	Korean	XIM
imhangul	Korean	GTK+
Nabi	Korean	XIM
qimhangul	Korean	Qt
xvnkb	Vietnamese	XIM
x-unikey	Vietnamese	XIM

Source: wikipedia page on Input Method Editor

Appendix

ISO codes for languages

Refer http://www.unicode.org/unicode/onlinedat/languages.html

Unicode Ranges

Refer the Unicode charts http://unicode.org/charts/

References

Java Internationalization, Andrew Deitsch and David Czarnecki, O'Reilly, First Edition,2001,p 1-15