Glossary of Terms



byte

A byte is an unsigned 8-bit quantity that can hold values in the range 0x00-0xFF. Bytes are usually represented in OpenTop using the Byte typedef.

It is important to draw the distinction between bytes and characters, especially when dealing with a large character set like Unicode. Some encoding systems can represent the code-point for a Unicode character using a single byte, but these encodings are limited to a very small subset of the Unicode range. Generally a Unicode character is encoded into a sequence of one or more bytes.



code-point

A code-point is the numerical integer value given to a Unicode character. The terms 'Unicode character' and 'Unicode code-point' are often used interchangeably.



UCS-4

The UCS-4 encoding represents each Unicode character as a 32-bit value. As this is more than enough to hold the entire Unicode range, each Unicode code-point is represented as itself, therefore no encoding is required.

UCS-4 is the preferred character encoding on platforms that have a 32-bit wchar_t, e.g. Linux. UCS-4 presents an ideal way to represent characters in memory, but it is rarely used as an encoding for serializing characters to a file or across a network connection. For this task UTF-8 or UTF-16 is usually used.



Unicode

In the words of The Unicode Organization:
"Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use."

The World Wide Web Consortium (W3C) has adopted Unicode as the preferred character encoding scheme, with the result that recent recommendations such as XML are specified in terms of Unicode characters. Over time, Unicode is likely to become the dominant character encoding scheme in use everywhere.

Representing Unicode in C++

Although Unicode provides a uniform representation (code-point) for each character in use worldwide, it does not specify a single uniform way to represent the encoded characters in computer memory. In C++ programs, we generally represent characters using one of the fundamental types: char and wchar_t. A char is usually an 8-bit value with a value range of 0-255 when unsigned or -128-127 when signed. As the Unicode specification describes about 0.5 million characters we obviously cannot use char to represent them all.

According to the ISO/C++ specification:

"type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales".

Ideally wchar_t should be large enough to hold all the Unicode values, but this is unfortunately not always the case. The Unicode range 0-0x10FFFF requires at least 21-bits, but on some platforms (notably Windows) wchar_t is only 16-bits wide. On these platforms, the only way to encode the whole of the Unicode range is to use a multi-byte or multi-character encoding.

In a typical project, another problem that must often be addressed is the integration of new code with existing libraries and operating system APIs. These are often based on char interfaces, sometimes limited to ASCII but sometimes encoded according to a locale.

Unicode in OpenTop

OpenTop offers full support for the Unicode 3.0 assigned character range, providing a choice of character types and encoding methods. See the API documentation for SystemCodeConverter for further information about how OpenTop deals with Unicode characters and strings.



UTF-16

The UTF-16 encoding represents each Unicode character using one or two 16-bit values. Unicode characters in the range U+0000-U+FFFF are represented using a single 16-bit value, except that Unicode characters in the surrogate range (U+D800-U+DFFF) are disallowed. As Unicode reserves the surrogate range for use by UTF-16, they are not legal Unicode characters anyway. Unicode characters in the range U+10000-U+10FFFF are represented using a pair of 16-bit values, each in the surrogate range (0xD800-0xDFFF).

Two variants of UTF-16 exist, a big-endian form (UTF-16BE) and a little-endian form (UTF-16LE). When reading a UTF-16 encoded file, the system expects the first 16-bit value to represent a Byte Order Mark (the Unicode character U+FEFF), which informs the program whether the characters were written on a big-endian or little-endian machine.

UTF-16 has become very popular because it is the native character encoding scheme used in Java applications.



UTF-8

UTF-8 is a method for encoding Unicode characters into a sequence of one or more bytes. It is defined in ISO 10646-1:2000 Annex D and also described in RFC 2279 as well as section 3.8 of the Unicode 3.0 standard.

UTF-8 has the following properties:

UTF-8 is a particularly attractive encoding scheme because, as described above, it preserves ASCII values (0x00-0x7F) and multi-byte sequences do not use the ASCII range. This means that libraries and APIs that offer an ASCII interface can normally still be used with UTF-8 encoded character strings.

The main draw-back with UTF-8 is that a single char may no longer represent a single character, each char must be inspected to see if it is part of a multi-octet sequence.



Found a bug or missing feature? Please email us at support@elcel.com

Copyright © 2000-2003 ElCel Technology   Trademark Acknowledgements