SUMMARY: MODULE | CLASS | TYPE | PROC | VAR | CONST | DETAIL: TYPE | PROC | VAR | CONST |
Class List | |
Codec | |
Factory |
Class Summary: Codec [Detail] | |
+--XML:UnicodeCodec.Codec | |
Field Summary | |
bom-: SHORTINT This is a copy of the creating factory's Factory.bom. | |
invalidChars: LONGINT | |
Constructor Summary | |
Init(Codec, SHORTINT) | |
Method Summary | |
Decode(VAR ARRAY OF CHAR, LONGINT, LONGINT, VAR ARRAY OF LONGCHAR, LONGINT, LONGINT, VAR LONGINT, VAR LONGINT) Decodes the bytes in `source[sourceStart, sourceEnd[' into the Unicode sequence `dest[destStart, destEnd['. | |
Encode(VAR ARRAY OF LONGCHAR, LONGINT, LONGINT, VAR ARRAY OF CHAR, LONGINT, LONGINT, VAR LONGINT, VAR LONGINT) Encodes the Unicode characters in `source[sourceStart, sourceEnd[' into the byte sequence `dest[destStart, destEnd['. | |
EncodeBOM(VAR ARRAY OF CHAR, LONGINT, LONGINT, VAR LONGINT) Appends a byte order mark to dest. |
Class Summary: Factory [Detail] | |
+--XML:UnicodeCodec.Factory | |
Field Summary | |
bom-: SHORTINT This flag describes how the generated codes deal with byte order markers. | |
Constructor Summary | |
GetFactory(ARRAY OF CHAR): Factory | |
InitFactory(Factory, SHORTINT) Initializes factory f with the byte order characteristic bom. | |
Method Summary | |
GetEncodingName(VAR ARRAY OF CHAR) Returns the preferred MIME name for the factory's encoding. | |
NewCodec(): Codec Creates a new codec from factory f. | |
NewCodecBOM(VAR ARRAY OF CHAR, LONGINT, LONGINT, VAR LONGINT): Codec Creates a new codec from factoriy f, taking the byte order mark into account. |
Procedure Summary | |
Register(ARRAY OF CHAR, Factory) | |
Unregister(ARRAY OF CHAR) |
Constant Summary | |
bomNotApplicable Byte order markers are not applicable to the encoding. | |
bomOptional A document using this encoding may begin with a byte order mark. | |
bomRequired The encoding requires that the document begins with a byte order mark. | |
byteOrderMark The Unicode byte order mark. | |
decodeError This character is used to replace incoming characters whose value is unknown or unrepresentable in Unicode. | |
encodeError This character is used to replace outgoing characters whose value cannot be represented by the used encoding scheme. | |
maxUCS2EncodingLength Maximum length of any UCS-4 character, encoded with Unicode codes. | |
maxUTF8EncodingLength Maximum length of the character encodings of all possible UCS-4 (not just Unicode!) characters, for all known encodings. |
Class Detail: Codec |
Field Detail |
FIELD bom-: SHORTINT
This is a copy of the creating factory's Factory.bom.
FIELD invalidChars: LONGINT
Constructor Detail |
PROCEDURE Init(codec: Codec; bom: SHORTINT)
Method Detail |
PROCEDURE (codec: Codec) Decode(VAR source: ARRAY OF CHAR; sourceStart: LONGINT; sourceEnd: LONGINT; VAR dest: ARRAY OF LONGCHAR; destStart: LONGINT; destEnd: LONGINT; VAR sourceDone: LONGINT; VAR destDone: LONGINT)
Decodes the bytes in `source[sourceStart, sourceEnd[' into the Unicode sequence `dest[destStart, destEnd['.
sourceStart < sourceEnd, and the character sequence `source[sourceStart, sourceEnd[' holds the characters that are to be decoded.
destEnd-destStart >= maxUCS2EncodingLength. In other words, there must be enough room in the destination sequence `dest[destStart, destEnd[' to hold at least one UCS-4 character, possibly split into a high and low surrogate pair.
sourceStart is the value of sourceDone of a previous call to this procedure (or Factory.NewCodecBOM), or the address of `source[sourceStart]' is aligned on a 4-byte boundary. This ensures, that the decoder functions can access the source sequence in chunks of 2 and 4 bytes, without needing to worry about the alignment of memory accesses.
sourceEnd-sourceStart >= maxUTF8EncodingLength, or sourceEnd designates the end of the byte sequence being decoded. This means, that at least one complete character is encoded in the input sequence, or the input sequence ends with a possibly incomplete character.
sourceStart < sourceDone <= sourceEnd and destStart < destDone <= destEnd. This means, that at least one character has been decoded.
sourceDone > sourceEnd-maxUTF8EncodingLength or destDone > destEnd-maxUCS2EncodingLength. This implies, that the decoding algorithm continues until it gets near the end of the source or destination buffer. But the implementation of the decoding algorithm can be set up in such a way, that it stops when the input or output sequence of the next character may not fit into the buffers. It must not decode the maximum number of bytes that fit into the buffers.
If the procedure was started with sourceEnd-sourceStart < maxUTF8EncodingLength, and if there is enough room in the destination buffer to store the whole result, then all remaining bytes in the source sequence have been decoded and sourceDone equals sourceEnd.
Every malformed character, and every decoded character code that can not be mapped onto a Unicode character (i.e., one or two UCS-2 values), is replaced with decodeError, and the counter codec.invalidChars is incremented by one. The output of the decoding function contains only valid characters. That is, all surrogate codes are properly paired, and the character codes U+FFFE and U+FFFF are replaced with decodeError.
`dest[destStart, destDone[' holds the result of decoding `source[sourceStart, sourceDone['.
PROCEDURE (codec: Codec) Encode(VAR source: ARRAY OF LONGCHAR; sourceStart: LONGINT; sourceEnd: LONGINT; VAR dest: ARRAY OF CHAR; destStart: LONGINT; destEnd: LONGINT; VAR sourceDone: LONGINT; VAR destDone: LONGINT)
Encodes the Unicode characters in `source[sourceStart, sourceEnd[' into the byte sequence `dest[destStart, destEnd['.
sourceStart < sourceEnd, and the Unicode character sequence `source[sourceStart, sourceEnd[' holds the characters that are to be decoded.
destEnd-destStart >= maxUTF8EncodingLength. In other words, there must be enough room in the destination sequence `dest[destStart, destEnd[' to hold at least one UCS-4 character, possibly encoded as a sequence of 6 bytes.
destStart is the value of destDone of a previous call to this procedure (or Factory.NewCodecBOM), or the address of `dest[destStart]' is aligned on a 4-byte boundary. This ensures, that the encoder functions can access the destination sequence in chunks of 2 and 4 bytes, without needing to worry about the alignment of memory accesses.
sourceEnd-sourceStart >= maxUCS2EncodingLength, or sourceEnd designates the end of the character sequence being decoded. This means, that at least one complete character is in the input sequence, or the input sequence ends with a possibly incomplete character.
sourceStart < sourceDone <= sourceEnd and destStart < destDone <= destEnd. This means, that at least one character has been encoded.
sourceDone > sourceEnd-maxUCS2EncodingLength or destDone > destEnd-maxUTF8EncodingLength. This implies, that the encoding algorithm continues until it gets near the end of the source or destination buffer. But the implementation of the encoding algorithm can be set up in such a way, that it stops when the input or output sequence of the next character may not fit into the buffers. It must not decode the maximum number of bytes that fit into the buffers.
If the procedure was started with sourceEnd-sourceStart < maxUCS2EncodingLength, and if there is enough room in the destination buffer to store the whole result, then all remaining bytes in the source sequence have been encoded and sourceDone equals sourceEnd.
Every malformed character, and every character code that can not be mapped onto a valid encoding, is replaced with encodeError, and the counter codec.invalidChars is incremented by one. Out of range Unicode characters encoded as a (high, low) surrogate pair are recognized as a single invalid character. The character codes U+FFFE and U+FFFF are also mapped to encodeError.
`dest[destStart, destDone[' holds the result of encoding `source[sourceStart, sourceDone['.
PROCEDURE (codec: Codec) EncodeBOM(VAR dest: ARRAY OF CHAR; destStart: LONGINT; destEnd: LONGINT; VAR destDone: LONGINT)
Appends a byte order mark to dest. If the field codec.bom is bomNotApplicable, nothing is done and destDone is set to destStart. Otherwise, the codec-specific encoding of the byte order mark is appended to dest, and destDone is set to the position after the mark.
Pre-condition: destEnd-destStart >= maxUTF8EncodingLength. In other words, there must be enough room in the destination sequence `dest[destStart, destEnd[' to hold at least one UCS-4 character, possibly encoded as a sequence of 6 bytes.
Class Detail: Factory |
Field Detail |
FIELD bom-: SHORTINT
This flag describes how the generated codes deal with byte order markers. It can be one of bomNotApplicable, bomOptional, or bomRequired.
Constructor Detail |
PROCEDURE GetFactory(name: ARRAY OF CHAR): Factory
PROCEDURE InitFactory(f: Factory; bom: SHORTINT)
Initializes factory f with the byte order characteristic bom.
Method Detail |
PROCEDURE (f: Factory) GetEncodingName(VAR name: ARRAY OF CHAR)
Returns the preferred MIME name for the factory's encoding.
PROCEDURE (f: Factory) NewCodec(): Codec
Creates a new codec from factory f. This should not be called for factories with an Factory.bom of bomOptional or bomRequired.
PROCEDURE (f: Factory) NewCodecBOM(VAR source: ARRAY OF CHAR; sourceStart: LONGINT; sourceEnd: LONGINT; VAR sourceDone: LONGINT): Codec
Creates a new codec from factoriy f, taking the byte order mark into account. The exact behaviour of this procedure depends on the value of f.bom.
bomNotApplicable Any byte order mark is ignored, and sourceDone is set to sourceStart.
bomOptional If the source begins with a byte order mark, it is removed from the input and the correspondig codec is returned, and the parameter sourceDone is set after the end of the byte order mark. If there is no byte order mark, sourceDone is set to sourceStart and the default codec is returned.
bomRequired In the presence of a byte order mark, this is just like bomOptional, but without a byter order mark the returned codec's Codec.invalidChars counter is set to one and sourceDone is set to sourceStart.
Pre-condition: sourceEnd-sourceStart >= maxUTF8EncodingLength, or sourceEnd designates the end of the byte sequence being decoded. This means, that at least one complete character is encoded in the input sequence, or the input sequence ends with a possibly incomplete character.
Procedure Detail |
PROCEDURE Register(name: ARRAY OF CHAR; factory: Factory)
PROCEDURE Unregister(name: ARRAY OF CHAR)
Constant Detail |
CONST bomNotApplicable
Byte order markers are not applicable to the encoding. This is the case if the encoding
maps all characters to single bytes (like `US-ASCII' or `ISO-8859-1'),
is byte order independent (like `UTF-8'), or
defines the byte order explicitly and requires that the document does not start with a byte order mark (like `UTF-16LE' and `UTF-16BE').
All instances of UxFEFF are passed to the application, regardless of their position in the document. Therefore, if a document using such an encoding begins with UxFEFF, this character is reported to the application as `ZERO WIDTH NO-BREAK SPACE'.
CONST bomOptional
A document using this encoding may begin with a byte order mark. Without a byte order mark, a encoding specific default byte order is assumed. An example is `UTF-16', which defaults to big endian byte order in the absence of a byte order mark. If a document begins with a byte order mark, then this first character should not be reported to the application.
CONST bomRequired
The encoding requires that the document begins with a byte order mark. The BOM should not be passed to the application. Currently, no codec implementation uses this variant.
CONST byteOrderMark
The Unicode byte order mark. Also known as `ZERO WIDTH NO-BREAK SPACE'.
CONST decodeError
This character is used to replace incoming characters whose value is unknown or unrepresentable in Unicode. This character is assigned the Unicode name "REPLACEMENT CHARACTER".
CONST encodeError
This character is used to replace outgoing characters whose value cannot be represented by the used encoding scheme. It is assumed, that all character encodings know the question mark character (Unicode Ux003F).
CONST maxUCS2EncodingLength
Maximum length of any UCS-4 character, encoded with Unicode codes. In the worst case, a pair of high and low surrogate codes must be used.
CONST maxUTF8EncodingLength
Maximum length of the character encodings of all possible UCS-4 (not just Unicode!) characters, for all known encodings. Currently, the longest encoding in UTF-8 needs 6 bytes.