Using multiple character sets

Introduction

This document provides an overview of how a web developer can code multi-language websites with different character sets. The examples uses ASP as the presentation server development language. A brief description is given on which settings needs to be configured within Tridion per website as well as providing best practises on presentation side coding for multiple languages.

1. Character Sets

In this document we will be talking about different character sets. Therefore it is important to understand what exactly a character set is.  

Definition

A character encoding is a code that pairs a set of natural language characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses. Common examples include Morse code, which encodes letters of the Latin alphabet as series of long and short depressions of a telegraph key; and ASCII, which encodes letters, numerals, and other symbols as both integers and 7-bit binary versions of those integers.

History

In computers and in data transmission between them, i.e. in digital data processing and transfer, data is internally presented as octets, as a rule. An octet is a small unit of data with a numerical value between 0 and 255, inclusively.The numerical values are presented in the normal (decimal) notation here, but notice that other presentations are used too, especially octal (base 8) or hexadecimal (base 16) notation. Octets are often called bytes, but in principle, octet is a more definite concept than byte. Internally, octets consist of eight bits (hence the name, from Latin octo'eight'), but we need not go into bit level here. However, you might need to know what the phrase "first bit set" or "sign bit set" means, since it is often used. In terms of numerical values of octets, it means that the value is greater than 127. In various contexts, such octets are sometimes interpreted as negative numbers, and this may cause various problems.
Different conventions can be established as regards to how an octet or a sequence of octets presents some data. For instance, four consecutive octets often form a unit that presents a real number according to a specific standard. We are here interested in the presentation of character data (or string data; a string is a sequence of characters) only.
In the simplest case, which is still widely used, one octet corresponds to one character according to some mapping table (encoding). Naturally, this allows at most 256 different characters being represented. There are several different encodings, such as the well-known ASCII encoding and the ISO Latin family of encodings. The correct interpretation and processing of character data of course requires knowledge about the encoding used. For HTML documents, such information should be sent by the Web server along with the document itself, using so-called HTTP headers (cf. to MIME headers).
Previously the ASCII encoding was usually assumed by default (and it is still very common). Nowadays ISO Latin 1, which can be regarded as an extension of ASCII, is often the default. The current trend is to avoid giving such a special position to ISO Latin 1 among the variety of encodings.
To enable the correct display of characters that are not found in the default ASCII character definition (0-255) the current web trend is to use the UTF-8 character set encoding to display otherwise unknown characters within the web browser(think of Chinese symbols, Arabic words etc.).   
 
In this document we will be looking at an example of how to display a foreign language correctly within an HTML document using UTF-8 encoding.   

2. Presentation Server

For the example shown I am assuming that ASP is the development language being used.
To enable the HTML document to display the content in the correct fashion you need to include the following pieces of code within your ASP code.

  1. You need to specify the correct encoding type for your ASP execution.
    <% @Language=VBScript Codepage=65001%>=
    <% Response.CodePage="65001" %>
  2. You need to find the Codepage value that matches the language that you are displaying.   
    You need to add a META tag defining the encoding type for the document:
    <meta http-equiv=Content-Type content="text/html; charset=UTF-8">
  3. If you wish to display a language that reads from right to left (bi-directional text, eg. Arabic) you can also force this by configuring the following to your HTML tag code:

3. Tridion Content Manager

We have taken a look at what you need to include within your presentation side coding to enable the correct displaying of foreign languages. These coding pieces will most likely reside within your Page templates. When you publish your pages the correct coding will then be inserted per publication / web site / language.   
There are also a few settings that you have to configure within the Tridion Content Manager GUI (graphical user interface) to enable the functionality.   
When you publish a page within a publication it is published to a publication target. Every publication target has the Target Language and Default Code Page values that you can configure.   This is dependant to which type of presentation server you are publishing to (IIS, BEA, IBM Websphere etc.)  

Once you have set the correct Target Language you now have to set the correct Default Code Page so that the rendering of your pages provide the content with the correct Unicode encoding standard.   
When you are publishing your content in a foreign language you will most likely set this value to Unicode (UTF-8).
 
Once you have configured the publication targets correctly and updated your templates to provide the client web browser with the correct Unicode standard to use you can now go ahead and publish your website contents for all foreign languages.

4. Blueprinting

When you are creating multiple websites in different languages within the Tridion Content Manager you can utilize Blueprinting to ease the encoding that is needed to output the correct content.   
 
One example is to put the Unicode standards and META tag information in a system component. Your page template then reads this information from the system component.   
You can then localize the system component per publication/website and configure the correct character set to be used to display that specific language correctly.