Cross-Platform C++

ot
class SystemCodeConverter

#include "ot/base/SystemCodeConverter.h"

ot::CodeConverterBase Class module for converting Unicode strings to and from the internal character encoding. The OpenTop library has been designed to offer a high degree of flexibility in the way that it represents Unicode characters. It can be configured to use either char or wchar_t characters.

When configured to use char characters, OpenTop encodes Unicode characters into UTF-8. The size of a wchar_t is not uniformly defined on different platforms, so OpenTop offers a choice of two encoding schemes when configured to use wchar_t characters: UCS-4 for 32-bit implementations and UTF-16 for 16-bit implementations.

character type size (bits) encoding(s)
char 8 UTF-8
wchar_t 16 UTF-16
wchar_t 32 UCS-4




Method Summary
static Result FromInternalEncoding(UCS4Char& ch, const CharType* from, const CharType* from_end, const CharType*& from_next)
         Converts a sequence of 1 or more CharType characters, which are encoded according to the internal Unicode encoding scheme, into the code-point for a Unicode character.
static size_t GetCharSequenceLength(UCharType ch)
         Returns the number of CharType characters that make up a multi-character sequence.
static String GetInternalEncodingName()
         Returns the name of the encoding scheme used by OpenTop to encode Unicode characters.
static size_t GetMaximumCharSequenceLength()
         Returns the maximum number of CharType characters that may be used to encode a single Unicode character.
static bool IsSequenceStartChar(UCharType ch)
         Tests the passed character ch to see if it marks the start of a multi-character sequence, a standalone character or a trailing character.
static bool IsValidCharSequence(const CharType* from, size_t len)
         Tests the passed CharType sequence from, for a length of len to see if it represents a properly encoded Unicode character.
static Result TestEncodedSequence(const CharType* from, const CharType* from_end, const CharType*& from_next)
         Tests a sequence of CharType characters to check that it is encoded according to the chosen OpenTop internal encoding scheme.
static String ToInternalEncoding(UCS4Char ch)
         Returns the Unicode character ch as a String containing a sequence of one or more CharType characters encoded using the OpenTop internal encoding scheme.
static Result ToInternalEncoding(UCS4Char ch, CharType* to, const CharType* to_limit, CharType*& to_next)
         Converts a Unicode character value into a sequence of one or more CharType characters encoded according to the internal Unicode encoding scheme.

Methods inherited from class ot::CodeConverterBase
IsLegalUTF16, IsLegalUTF8, UTF8Decode, UTF8Encode

Method Detail

FromInternalEncoding

static Result FromInternalEncoding(UCS4Char& ch,
                                   const CharType* from,
                                   const CharType* from_end,
                                   const CharType*& from_next)
Converts a sequence of 1 or more CharType characters, which are encoded according to the internal Unicode encoding scheme, into the code-point for a Unicode character.

Parameters:
ch - a return parameter giving the Unicode character's code-point value in the range 0-0x10FFFF
from - a pointer to the beginning of a CharType buffer that holds the encoded sequence
from_end - a pointer to the end of the passed CharType buffer
from_next - a return parameter, points to the beginning of the next multi-character sequence in the passed CharType buffer
Returns:
A CodeConverterBase::Result indicating the result of the conversion.
Exceptions:
NullPointerException - if from or from_end are null

GetCharSequenceLength

static size_t GetCharSequenceLength(UCharType ch)
Returns the number of CharType characters that make up a multi-character sequence. Unless the internal encoding used by the library is UCS-4 (where CharType is at least 21-bits wide and can represent all Unicode characters from U+0000 - U+10FFFF), Unicode characters are represented internally using a sequence of one or more CharType characters.

Each multi-character encoding supported (UTF-16 and UTF-8) allows the length of the sequence to be determined from the first character of the sequence.

In the case of UTF-16 this is 1 unless ch is a surrogate pair start character (0xD800-0xDBFF) in which case the length is 2.

In the case of UTF-8 the sequence length can be established by looking at the number of high-order bits set to '1' in the passed char ch. If no high-order bits are set, then the passed character is equivalent to an ASCII character and the sequence has a length of 1. In common with the rest of OpenTop, this method does not recognize UTF-8 sequences greater than 4 bytes. Lead bytes that indicate sequences longer than 4 are treated as indicating a sequence of length 1.

Parameters:
ch - The first character in a sequence of one or more CharType characters
Returns:
the length of the multi-character sequence denoted by the start character ch

GetInternalEncodingName

static String GetInternalEncodingName()
Returns the name of the encoding scheme used by OpenTop to encode Unicode characters.

Returns:
a String containing the name of the internal character encoding in use. e.g. "UTF-8"

GetMaximumCharSequenceLength

static size_t GetMaximumCharSequenceLength()
Returns the maximum number of CharType characters that may be used to encode a single Unicode character. The return value depends on whether char or wchar_t has been selected as the OpenTop character type as well as the operating system platform.


IsSequenceStartChar

static bool IsSequenceStartChar(UCharType ch)
Tests the passed character ch to see if it marks the start of a multi-character sequence, a standalone character or a trailing character.

Parameters:
ch - character to test
Returns:
true if ch is either a standalone character or marks the start of a multi-character sequence; false otherwise

IsValidCharSequence

static bool IsValidCharSequence(const CharType* from,
                                size_t len)
Tests the passed CharType sequence from, for a length of len to see if it represents a properly encoded Unicode character.

Parameters:
from - pointer to the first CharType character in the sequence
len - the number of CharType characters in the encoded sequence
Returns:
true if the sequence represents a valid Unicode character; false otherwise

TestEncodedSequence

static Result TestEncodedSequence(const CharType* from,
                                  const CharType* from_end,
                                  const CharType*& from_next)
Tests a sequence of CharType characters to check that it is encoded according to the chosen OpenTop internal encoding scheme.

Parameters:
from - a pointer to the beginning of a CharType buffer that holds the encoded sequence
from_end - a pointer to the end of the passed CharType buffer
from_next - a return parameter, points to the beginning of the next multi-character sequence in the passed CharType buffer
Returns:
A CodeConverterBase::Result indicating the result of the test.
Exceptions:
NullPointerException - if from or from_end are null

ToInternalEncoding

static String ToInternalEncoding(UCS4Char ch)
Returns the Unicode character ch as a String containing a sequence of one or more CharType characters encoded using the OpenTop internal encoding scheme.

Parameters:
ch - the Unicode character to encode
Returns:
a String containing a sequence of one or more CharType characters
Exceptions:
IllegalCharacterException - if ch cannot be encoded into the OpenTop internal encoding

ToInternalEncoding

static Result ToInternalEncoding(UCS4Char ch,
                                 CharType* to,
                                 const CharType* to_limit,
                                 CharType*& to_next)
Converts a Unicode character value into a sequence of one or more CharType characters encoded according to the internal Unicode encoding scheme. The caller must provide a buffer of CharType characters that will be used to hold the result of the conversion.

Parameters:
ch - the Unicode character's code-point value in the range 0-0x10FFFF
to - a pointer to the first CharType character of a buffer to hold the result of the conversion
to_limit - a pointer to the next CharType character after the end of the output buffer
to_next - a return parameter, points to the first unused character in the passed CharType buffer
Returns:
A CodeConverterBase::Result indicating the result of the conversion.
Exceptions:
NullPointerException - if to or to_limit are null


Cross-Platform C++

Found a bug or missing feature? Please email us at support@elcel.com

Copyright © 2000-2003 ElCel Technology   Trademark Acknowledgements