Validating a document against a RELAX NG schema is similar to matching some text against a regular expression. If the document ``matches'' the schema, the document is valid, and this, no matter which sub-expressions were used during the match.
Example: string "b
" matches regular expression "(a?,b)|(b,c?)
" and we don't care if it matches sub-expression "(a?,b)
" or sub-expression "(b,c?)
". The situation is exactly the same with RELAX NG schemas, simply replace the characters and the character classes used in a regular expression by RELAX NG patterns.
The job of a RELAX NG schema is a validate a document as a whole, and that's it. For XXE, the problem to solve is different. One of the main jobs of XXE is to guide the user when he/she edits an XML document. That is, one of the main jobs of XXE is to identify the content model of the element which is being edited, in order to suggest the right attributes and the right child elements for it.
To do that, XXE needs to know precisely which ``sub-expressions were used during the match''. Unfortunately, sometimes, this is impossible to do.
All examples used in this section are found in
. Note that they are all valid schemas and valid documents.XXE_install_dir
/doc/rngsupport/samples/
Example 1. Ambiguous elements
RELAX NG schema, target.rnc:
start = build-element build-element = element build { target-element* } target-element = element target { attribute name { xsd:ID }, element list { ref-element* }?, element list { action-element* }? } ref-element = element ref { attribute name { xsd:IDREF } } action-element = element action { text }
Document conforming to the above schema, target_bad.xml:
<build> <target name="all"> <list> </list> </target> <target name="compile"/> <target name="link"/> </build>
If you open target_bad.xml
in XXE and select the list
element, XXE is lost: is it the list
element which contains ref
s or is the list
element which contains action
s? Both list
content models are fine in the case of an empty list
element!
Now, if you open target_good.xml in XXE, there is no problem at all:
<build> <target name="all"> <list> <ref name="compile"/> <ref name="link"/> </list> <list> <action>cc -c *.c</action> <action>cc *.o</action> </list> </target> <target name="compile"/> <target name="link"/> </build>
However, in the vast majority of realistic cases, XXE knows how to make a difference between two child elements having the same name and having different content models.
RELAX NG schema, sect.rnc:
start = doc-element doc-element = element doc { (simple-sect| recursive-sect)+ } simple-sect = element sect { attribute class {"simple"}, paragraph-element* } recursive-sect = element sect { attribute class {"recursive"}?, (recursive-sect|simple-sect)* } paragraph-element = element paragraph { text }
Document conforming to the above schema, sect.xml:
<doc> <sect> <sect></sect> <sect class="simple"> <paragraph>Paragraph 2.</paragraph> </sect> </sect> <sect class="simple"></sect> </doc>
In the above example, XXE has no problem at all making a difference between the empty <sect>
element and the empty <sect class="simple">
element. The reason is obviously because the first kind of sect
element has a required attribute class
with simple
as its fixed value.
When XXE is ``lost'', it automatically enters a lenient editing mode. In this mode, XXE can no longer guide the user when he/she edits the element which poses problems.
The node path bar is used to signal elements which are in this non-validating, lenient editing mode:
An element underlined in red means that this element is in non-validating mode 2. In this mode, XMLmind XML Editor is not able to suggest the right attributes and the right child elements to the user. The user may add and remove any attributes and child elements he/she wants, at any place and in any number.
An element underlined in orange means that this element is in non-validating mode 1. In this mode, XMLmind XML Editor still suggests the right attributes and child elements to the user. But these are only suggestions: the user may add and remove any attributes and child elements he/she wants, and this, at any place and in any number.
Note that the lenient editing mode is local to an element and its descendants. It is not used for the whole document, but just for the element for which XXE has troubles.
Also note that, after modifying an element which poses problems to XXE, if these problems are solved, XXE will automatically switch to its normal, strict, validating mode.
The node path bar is not the only tool in XXE which can help the user recognize attributes and elements which pose problems to the XML editor. The window opened by command
| also displays very useful information.Example 2. Not specific to RELAX NG
RELAX NG schema, name.rnc:
start = names-element names-element = element names { name-element+ } name-element = element name { element fullName { text } | (element firstName { text } & element lastName { text }) }
Document conforming to the above schema, name.xml:
<names> <name><fullName>John Smith</fullName></name> <name><firstName>John</firstName><lastName>Smith</lastName></name> <name><lastName>Smith</lastName><firstName>John</firstName></name> </names>
XXE allows to replace the firstName
, lastName
pair by a fullName
. Simply select both child elements and use command | . But it is impossible to replace a fullName
by a firstName
, lastName
pair.
The only way to do this is to select the fullName
to be replaced and then, to use command | . This will force XXE to enter the lenient editing mode. Remember that in this mode, the user is allowed to add any child elements he/she wants, including a firstName
, lastName
pair[1].
Note that the above example is not specific to RELAX NG. It is possible to model this kind of content with a DTD or a W3C XML Schema.
The example below is very similar but can only be expressed using a RELAX NG schema. This is the case, because, unlike a DTD and a W3C XML Schema, a RELAX NG schema can be used to specify the places within an element where text nodes may occur.
Example 3. Specific to RELAX NG
RELAX NG schema, name2.rnc:
start = names-element names-element = element names { name-element+ } name-element = element name { text | (element firstName { text } & element lastName { text }) }
Document conforming to the above schema, name2.xml:
<names> <name>John Smith</name> <name><firstName>John</firstName><lastName>Smith</lastName></name> <name><lastName>Smith</lastName><firstName>John</firstName></name> </names>
The situation is worse with the name2.rnc example than with the name.rnc example. It is always allowed to delete a text node and this includes the text node containing "John Smith
". That is, there is no way to force XXE to enter its lenient mode in order to be able to replace text node "John Smith
" by a firstName
, lastName
pair.
In such case, using named element templates is the only way to cope with such content models. Simply specify two named element templates for element name
, one containing a text node with a placeholder string and the other containing a firstName
, lastName
pair.
[1] The right approach here is to define two named element templates for element name
, one containing a fullName
child element and the other containing a firstName
, lastName
pair.