XML:UnicodeCodec

Class List
Codec
Factory
Class Summary: Codec [Detail]
  +--XML:UnicodeCodec.Codec
Field Summary
bom-: SHORTINT

          This is a copy of the creating factory's Factory.bom.
invalidChars: LONGINT

          
Constructor Summary
Init(Codec, SHORTINT)

          
Method Summary
Decode(VAR ARRAY OF CHAR, LONGINT, LONGINT, VAR ARRAY OF LONGCHAR, LONGINT, LONGINT, VAR LONGINT, VAR LONGINT)

          Decodes the bytes in `source[sourceStart, sourceEnd[' into the Unicode sequence `dest[destStart, destEnd['.
Encode(VAR ARRAY OF LONGCHAR, LONGINT, LONGINT, VAR ARRAY OF CHAR, LONGINT, LONGINT, VAR LONGINT, VAR LONGINT)

          Encodes the Unicode characters in `source[sourceStart, sourceEnd[' into the byte sequence `dest[destStart, destEnd['.
EncodeBOM(VAR ARRAY OF CHAR, LONGINT, LONGINT, VAR LONGINT)

          Appends a byte order mark to dest.
 
Class Summary: Factory [Detail]
  +--XML:UnicodeCodec.Factory
Field Summary
bom-: SHORTINT

          This flag describes how the generated codes deal with byte order markers.
Constructor Summary
GetFactory(ARRAY OF CHAR): Factory

          
InitFactory(Factory, SHORTINT)

          Initializes factory f with the byte order characteristic bom.
Method Summary
GetEncodingName(VAR ARRAY OF CHAR)

          Returns the preferred MIME name for the factory's encoding.
NewCodec(): Codec

          Creates a new codec from factory f.
NewCodecBOM(VAR ARRAY OF CHAR, LONGINT, LONGINT, VAR LONGINT): Codec

          Creates a new codec from factoriy f, taking the byte order mark into account.
 
Procedure Summary
Register(ARRAY OF CHAR, Factory)

          
Unregister(ARRAY OF CHAR)

          
Constant Summary
bomNotApplicable

          Byte order markers are not applicable to the encoding.
bomOptional

          A document using this encoding may begin with a byte order mark.
bomRequired

          The encoding requires that the document begins with a byte order mark.
byteOrderMark

          The Unicode byte order mark.
decodeError

          This character is used to replace incoming characters whose value is unknown or unrepresentable in Unicode.
encodeError

          This character is used to replace outgoing characters whose value cannot be represented by the used encoding scheme.
maxUCS2EncodingLength

          Maximum length of any UCS-4 character, encoded with Unicode codes.
maxUTF8EncodingLength

          Maximum length of the character encodings of all possible UCS-4 (not just Unicode!) characters, for all known encodings.

Class Detail: Codec
Field Detail

bom

FIELD bom-: SHORTINT

This is a copy of the creating factory's Factory.bom.


invalidChars

FIELD invalidChars: LONGINT
Constructor Detail

Init

PROCEDURE Init(codec: Codec; 
               bom: SHORTINT)
Method Detail

Decode

PROCEDURE (codec: Codec) Decode(VAR source: ARRAY OF CHAR; 
                 sourceStart: LONGINT; 
                 sourceEnd: LONGINT; 
                 VAR dest: ARRAY OF LONGCHAR; 
                 destStart: LONGINT; 
                 destEnd: LONGINT; 
                 VAR sourceDone: LONGINT; 
                 VAR destDone: LONGINT)

Decodes the bytes in `source[sourceStart, sourceEnd[' into the Unicode sequence `dest[destStart, destEnd['.


Encode

PROCEDURE (codec: Codec) Encode(VAR source: ARRAY OF LONGCHAR; 
                 sourceStart: LONGINT; 
                 sourceEnd: LONGINT; 
                 VAR dest: ARRAY OF CHAR; 
                 destStart: LONGINT; 
                 destEnd: LONGINT; 
                 VAR sourceDone: LONGINT; 
                 VAR destDone: LONGINT)

Encodes the Unicode characters in `source[sourceStart, sourceEnd[' into the byte sequence `dest[destStart, destEnd['.


EncodeBOM

PROCEDURE (codec: Codec) EncodeBOM(VAR dest: ARRAY OF CHAR; 
                    destStart: LONGINT; 
                    destEnd: LONGINT; 
                    VAR destDone: LONGINT)

Appends a byte order mark to dest. If the field codec.bom is bomNotApplicable, nothing is done and destDone is set to destStart. Otherwise, the codec-specific encoding of the byte order mark is appended to dest, and destDone is set to the position after the mark.

Pre-condition: destEnd-destStart >= maxUTF8EncodingLength. In other words, there must be enough room in the destination sequence `dest[destStart, destEnd[' to hold at least one UCS-4 character, possibly encoded as a sequence of 6 bytes.

 
Class Detail: Factory
Field Detail

bom

FIELD bom-: SHORTINT

This flag describes how the generated codes deal with byte order markers. It can be one of bomNotApplicable, bomOptional, or bomRequired.

Constructor Detail

GetFactory

PROCEDURE GetFactory(name: ARRAY OF CHAR): Factory

InitFactory

PROCEDURE InitFactory(f: Factory; 
                      bom: SHORTINT)

Initializes factory f with the byte order characteristic bom.

Method Detail

GetEncodingName

PROCEDURE (f: Factory) GetEncodingName(VAR name: ARRAY OF CHAR)

Returns the preferred MIME name for the factory's encoding.


NewCodec

PROCEDURE (f: Factory) NewCodec(): Codec

Creates a new codec from factory f. This should not be called for factories with an Factory.bom of bomOptional or bomRequired.


NewCodecBOM

PROCEDURE (f: Factory) NewCodecBOM(VAR source: ARRAY OF CHAR; 
                      sourceStart: LONGINT; 
                      sourceEnd: LONGINT; 
                      VAR sourceDone: LONGINT): Codec

Creates a new codec from factoriy f, taking the byte order mark into account. The exact behaviour of this procedure depends on the value of f.bom.

bomNotApplicable

Any byte order mark is ignored, and sourceDone is set to sourceStart.

bomOptional

If the source begins with a byte order mark, it is removed from the input and the correspondig codec is returned, and the parameter sourceDone is set after the end of the byte order mark. If there is no byte order mark, sourceDone is set to sourceStart and the default codec is returned.

bomRequired

In the presence of a byte order mark, this is just like bomOptional, but without a byter order mark the returned codec's Codec.invalidChars counter is set to one and sourceDone is set to sourceStart.

Pre-condition: sourceEnd-sourceStart >= maxUTF8EncodingLength, or sourceEnd designates the end of the byte sequence being decoded. This means, that at least one complete character is encoded in the input sequence, or the input sequence ends with a possibly incomplete character.

 
Procedure Detail

Register

PROCEDURE Register(name: ARRAY OF CHAR; 
                   factory: Factory)

Unregister

PROCEDURE Unregister(name: ARRAY OF CHAR)
Constant Detail

bomNotApplicable

CONST bomNotApplicable 

Byte order markers are not applicable to the encoding. This is the case if the encoding

All instances of UxFEFF are passed to the application, regardless of their position in the document. Therefore, if a document using such an encoding begins with UxFEFF, this character is reported to the application as `ZERO WIDTH NO-BREAK SPACE'.


bomOptional

CONST bomOptional 

A document using this encoding may begin with a byte order mark. Without a byte order mark, a encoding specific default byte order is assumed. An example is `UTF-16', which defaults to big endian byte order in the absence of a byte order mark. If a document begins with a byte order mark, then this first character should not be reported to the application.


bomRequired

CONST bomRequired 

The encoding requires that the document begins with a byte order mark. The BOM should not be passed to the application. Currently, no codec implementation uses this variant.


byteOrderMark

CONST byteOrderMark 

The Unicode byte order mark. Also known as `ZERO WIDTH NO-BREAK SPACE'.


decodeError

CONST decodeError 

This character is used to replace incoming characters whose value is unknown or unrepresentable in Unicode. This character is assigned the Unicode name "REPLACEMENT CHARACTER".


encodeError

CONST encodeError 

This character is used to replace outgoing characters whose value cannot be represented by the used encoding scheme. It is assumed, that all character encodings know the question mark character (Unicode Ux003F).


maxUCS2EncodingLength

CONST maxUCS2EncodingLength 

Maximum length of any UCS-4 character, encoded with Unicode codes. In the worst case, a pair of high and low surrogate codes must be used.


maxUTF8EncodingLength

CONST maxUTF8EncodingLength 

Maximum length of the character encodings of all possible UCS-4 (not just Unicode!) characters, for all known encodings. Currently, the longest encoding in UTF-8 needs 6 bytes.