Text Encoding Initiative 22 Tables, Formulae, and Graphics
Contents
Many documents, both historical and contemporary, include not only text but also graphics, artwork, and other images. Although such images could be represented directly with markup, SGML and XML are not primarily designed for that purpose, and it is standard practice to include such information by declaring each image as an external entity encoded in a suitable graphical notation, and then referring to it from within the document.
In addition to graphic images, documents often contain material presented in graphical or tabular format. In such materials, details of layout and presentation may also be of comparatively greater significance or complexity than they are for running text. Indeed, it may often be difficult to make a clear distinction between details relating purely to the rendition of information and those relating to the information itself.
Finally, documents may contain mathematical formulæ or expressions in other formulaic notations, for which no notation is defined in these Guidelines.
These areas (graphics, tabular material, and mathematical or other formulæ) have in common that they have received considerable attention from many other standards bodies or similar professional groups. In part because of this, they may frequently be most conveniently encoded and processed using some notation not defined by these Guidelines. For these reasons, and others, we consider tables, formulæ, and graphics together in this chapter.
As with text markup in general, many incompatible formats have been proposed for the representation of graphics, formulae and tables in electronic form. Unfortunately, no single format as effective as SGML or XML in the domain of text has yet emerged for their interchange, to some extent because of the difficulty of representing the information these data formats convey independently of the way it is rendered.
The additional tag set defined by this chapter defines special purpose `container' elements that can be used to encapsulate occurrences of such data within a TEI-conformant document in a portable way. Specific recommendations for the encoding of tables are provided in section 22.1 Tables and recommendations for mathematical or other formulæ in section 22.2 Formulae. Specific recommendations for the encoding of graphic figures may be found in section 22.3 Specific Elements for Graphic Images. The rest of the chapter is devoted to general problems of encoding graphic information.
There is at the time of writing no consensus on formats for graphical images, and such formats vary in many ways. We therefore provide (in section 22.4 Overview of Basic Graphics Concepts) a brief discussion of the ways in which images may be represented, and (in section 22.5 Graphic Image Formats) a list of formal names for those representations most popular at this time. Each one includes a very brief description and (where known) a reference to the formal specification of the notation. These Guidelines recommend a few particular representations as being the most widely supported and understood.
To enable the additional tag set defined by this chapter, the parameter entity TEI.figures must be defined with the value INCLUDE, as shown in the example below:
<!DOCTYPE tei.2 PUBLIC "-//TEI P3//DTD Main Document Type//EN" "tei2.dtd" [ <!ENTITY % TEI.figures 'INCLUDE' > ]>
With this declaration in effect, the TEI elements and attributes described in the following sections are all available. If any of the specialized notations described in sections 22.2 Formulae and [space here] 22.3 Specific Elements for Graphic Images are used, then an additional notation declaration must also be included in the document type declaration subset, as further illustrated below.
The overall structure of the tag set defined in this chapter is as follows:
<!-- [DFTFF] 22.: Tables, Formulae, Figures --> [definitions from 22.1.1: Tables inserted here ] [definitions from 22.2: Formulae inserted here ] [definitions from 22.3: Figures inserted here ] <!-- end of [DFTFF] 22.-->
A table is the least `graphic' of the elements discussed in this chapter. Almost any text structure can be presented as a series of rows and columns: one might, for example, choose to show a glossary or other form of list in tabular form, without necessarily regarding it as a table. In such cases, the global rend attribute is an appropriate way of indicating that some element is being presented in tabular format. When tabular presentation is regarded as of less intrinsic importance, it is correspondingly simpler to encode descriptive or functional information about the contents of the table, for example to identify one column as containing names and another as containing dates, though the two methods may be combined.
When, however, particular elements are required to encode the tabular arrangement itself, then one or other of the various `table DTDs' now available may be preferable. The table DTDs in common use generally view a table as a special text element, made up of row elements (or, sometimes, column elements), themselves composed of cells. Table cells generally appear in row-major order, with the first row from left to right, then the second row, and so on. Details of appearance such as column widths, border lines, and alignment are generally encoded by numerous attributes. Beyond this, however, such DTDs differ greatly. This section begins by describing a table DTD of this kind; a brief summary of some other widely available table DTDs is also provided in section 22.1.2 Other Table DTDs.
For encoding tables of low to moderate complexity, these Guidelines provide the following special purpose elements:
rows | indicates the number of rows in the table. |
cols | indicates the number of columns in each row of the table. |
role | indicates the kind of information held in the cells of this row. |
role | indicates the kind of information held in the cell. |
rows | indicates the number of rows occupied by this cell. |
cols | indicates the number of columns occupied by this cell. |
The table element is defined as a member of the class inter; it may therefore appear both within other components (such as paragraphs), or between them, provided that the additional tag set defined in this chapter has been enabled, as described at the beginning of this chapter.
It is to a large extent arbitrary whether a table should be regarded as a series of rows or as a series of columns. For compatibility with currently available systems, however, these Guidelines require a row-by-row description of a table. It is also possible to describe a table simply as a series of cells; this may be useful for tabular material which is not presented as a simple matrix.
The attributes rows and cols may be used to indicate the size of a table, or to indicate that a particular cell of a table spans more than one row or column. For both tables and cells, rows and columns are always given in top-to-bottom, left-to-right order. These Guidelines do not require that the size of a table be specified; for most formatting and many other applications, it will be necessary to process the whole table in two passes in any case.
Where cells span more than one column or row, the encoder must determine whether this is a purely presentational effect (in which case the rend attribute may be more appropriate), whether the part of the table affected would be better treated as a nested table, or whether to use the spanning attributes listed above.
The role attribute may be used to categorize a single cell, or set a default for all the cells in a given row. The present Guidelines distinguish the roles of label and data only, but the encoder may define other roles, such as derived , numeric , etc., as appropriate.
The following simple example demonstrates how the data presented as a labelled list in section 6.7 Lists might be represented by an encoder wishing to preserve its original appearance as a table:
<table rend="boxed" rows="2" cols="2"> <head rend="it">Report of the conduct and progress of Ernest Pontifex. Upper Vth form — half term ending Midsummer 1851</head> <row role="data"> <cell role="label" rows="1" cols="1">Classics</cell> <cell role="data" rows="1" cols="1">Idle listless and unimproving</cell> </row> <row role="data"> <cell role="label" rows="1" cols="1">Mathematics</cell> <cell role="data" rows="1" cols="1">ditto</cell> </row> <row role="data"> <cell role="label" rows="1" cols="1">Divinity</cell> <cell role="data" rows="1" cols="1">ditto</cell> </row> <row role="data"> <cell role="label" rows="1" cols="1">Conduct in house</cell> <cell role="data" rows="1" cols="1">Orderly</cell> </row> <row role="data"> <cell role="label" rows="1" cols="1">General conduct</cell> <cell role="data" rows="1" cols="1">Not satisfactory, on account of his great unpunctuality and inattention to duties</cell> </row> </table>
Note that this encoding makes no attempt to represent the full significance of the ditto cells above; these might be regarded as simple links between the cells containing them and that to which they refer, or as virtual copies of it. For ways of representing either interpretation, see chapter 14 Linking, Segmentation, and Alignment.
The following example demonstrates how a simple statistical table may be represented using this scheme:
<table rows="4" cols="4"> <head>Poor Man's Lodgings in Norfolk (Mayhew, 1843)</head> <row role="label"> <cell role="data" rows="1" cols="1"> </cell> <cell role="data" rows="1" cols="1">Dossing Cribs or Lodging Houses </cell> <cell role="data" rows="1" cols="1">Beds </cell> <cell role="data" rows="1" cols="1">Needys or Nightly Lodgers</cell> </row> <row role="data"> <cell role="label" rows="1" cols="1">Bury St Edmund's </cell> <cell role="data" rows="1" cols="1">5</cell> <cell role="data" rows="1" cols="1">8</cell> <cell role="data" rows="1" cols="1">128</cell> </row> <row role="data"> <cell role="label" rows="1" cols="1">Thetford </cell> <cell role="data" rows="1" cols="1">3</cell> <cell role="data" rows="1" cols="1">6</cell> <cell role="data" rows="1" cols="1">36</cell> </row> <row role="data"> <cell role="label" rows="1" cols="1">Attleboro' </cell> <cell role="data" rows="1" cols="1">3</cell> <cell role="data" rows="1" cols="1">5</cell> <cell role="data" rows="1" cols="1">20</cell> </row> <row role="data"> <cell role="label" rows="1" cols="1">Wymondham </cell> <cell role="data" rows="1" cols="1">1</cell> <cell role="data" rows="1" cols="1">11</cell> <cell role="data" rows="1" cols="1">22<!-- ... --> </cell> </row> </table>
Note the use of a blank cell in the first row to ensure that the column labels are correctly aligned with the data. Again, this encoding does not explicitly represent the alignment between column and row labels and the data to which they apply. Where the primary emphasis of an encoding is on the semantic content of a table, a more explicit mechanism for the representation of structured information such as that provided by the feature structure mechanism described in chapter 16 Feature Structures may be preferred. Alternatively, the general purpose linkage and alignment mechanisms described in chapter 14 Linking, Segmentation, and Alignment may also be applied to individual cells of a table.
The content of a table cell need not be simply character data. It may also contain any sequence of the phrase level elements described in chapter 6 Elements Available in All TEI Documents, thus allowing for the encoding of potentially more useful semantic information, as in the following example, where the fact that one cell contains a number and the other contains a place name has been explicitly recorded:
<table> <head>US State populations, 1990</head> <row role="data"> <cell role="data" rows="1" cols="1"> <name>Wyoming</name> </cell> <cell role="data" rows="1" cols="1"> <num>453,588</num> </cell> </row> <row role="data"> <cell role="data" rows="1" cols="1"> <name>Alaska</name> </cell> <cell role="data" rows="1" cols="1"> <num>550,043</num> </cell> </row> <row role="data"> <cell role="data" rows="1" cols="1"> <name>Vermont</name> </cell> <cell role="data" rows="1" cols="1"> <num>562,758</num> </cell> </row> <row role="data"> <cell role="data" rows="1" cols="1"> <name>District of Columbia</name> </cell> <cell role="data" rows="1" cols="1"> <num>606,900</num> </cell> </row> <row role="data"> <cell role="data" rows="1" cols="1"> <name>North Dakota</name> </cell> <cell role="data" rows="1" cols="1"> <num>638,800</num> </cell> </row> <row role="data"> <cell role="data" rows="1" cols="1"> <name>Delaware</name> </cell> <cell role="data" rows="1" cols="1"> <num>666,168</num> </cell> </row> <row role="data"> <cell role="data" rows="1" cols="1"> <name>South Dakota</name> </cell> <cell role="data" rows="1" cols="1"> <num>696,004</num> </cell> </row> <row role="data"> <cell role="data" rows="1" cols="1"> <name>Montana</name> </cell> <cell role="data" rows="1" cols="1"> <num>799,065</num> </cell> </row> <row role="data"> <cell role="data" rows="1" cols="1"> <name>Rhode Island</name> </cell> <cell role="data" rows="1" cols="1"> <num>1,003,464</num> </cell> </row><!-- ... --> </table>
In syntactic terms this is little more than a name-change; however, the new names are more useful in that they convey something about the nature and significance of the information, rather than merely suggesting how to display it in rows and columns.
The TEI table elements are defined as follows:
<!-- [DFTTAB] 22.1.1: Tables --> <!ELEMENT %n.table; %om.RR; ((head | %m.Incl;)*, (row, (%m.Incl;)*)+)> <!ATTLIST %n.table; %a.global; rows CDATA #IMPLIED cols CDATA #IMPLIED TEIform CDATA 'table' > <!ELEMENT %n.row; %om.RO; ((cell|table), (%m.Incl;)*)+> <!ATTLIST %n.row; %a.global; role CDATA "data" TEIform CDATA 'row' > <!ELEMENT %n.cell; %om.RO; %paraContent;> <!ATTLIST %n.cell; %a.global; role CDATA "data" rows CDATA "1" cols CDATA "1" TEIform CDATA 'cell' > <!-- end of [DFTTAB] 22.1.1-->
Many authoring systems now include built-in support for their own or for public table DTDs. These often provide an enhanced user interface and good formatting capabilities, but are often product-specific, despite their use of a standard markup language such as SGML or XML.
The DTD developed by the Association of American Publishers (AAP) and standardized in ANSI Z39.59 provided a very simple encoding for correspondingly simple tables. This has been further developed, together with the table DTD documented in ISO Technical Report 9537, and now forms part of ISO 12083. The TEI DTD fragment described above has functionality very similar to that defined by ISO 12083.
For more complex tables, the most effective publically-available DTD is probably that developed by the US Department of Defense CALS project. This supports vertical and horizontal spanning and various kinds of text rotation and justification within cells and is also directly supported by a number of existing SGML software systems.
The CALS table DTD is much too complex to describe fully here; information on it can be obtained, among other places, from the Graphic Communications Association in Alexandria, Virginia. The formal name of the CALS SGML requirements is MIL-M-28001A. Tables conforming to the CALS DTD may be incorporated into documents conforming to these Guidelines, but this may require substantial modification of the TEI DTD which should not be undertaken without expert advice.
Comment. The following is new material.
Rationale. Recent developments related to CALS.
The apparent complexity of the CALS table DTD has led to simplification efforts, most notably by OASIS in form of OASIS Technical Resolution 9503:1995, Exchange Table Model Document Type Definition. An an XML version of this is defined in OASIS Technical Memorandum TR 9901:1999, XML Exchange Table Model DTD.
NOTE 124: The HTML table model has become very popular since the original publication of these Guidelines, and will likely be added to the list of recommended table models for TEI use. It's primary advantages are widespread support, and the feature of automatically-sized columns.
Comment. The following is new material.
Rationale. Response to NOTE. 124.
With the ascent of the Web, the HTML table model became popular and widespread in use. The last few years have seen a growing access to information via devices of varying capabilities ranging from desktop computers to handheld mobile phones. In response to this demand, HTML has been both reformulated into XML and modularized into a family of modules (see Modularization of XHTML) with semantically related elements. Modularization of XHTML includes two Table Modules: the Basic Tables Module and the Tables Module. Tables Module mimics the functionality of HTML 4 tables and is a part of XHTML 1.1 document type. Basic Tables Module is a "diluted" version of the Tables Module, retaining only the minimal functionality with the purpose of extending XHTML's reach onto resource-constrained devices. It is a part of XHTML Basic document type.
The semantics of the XHTML table model is based on the HTML table model. It allows arrangement of data into rows and columns of cells. Table rows and columns may be grouped to convey additional structural information and may be rendered by user agents in ways that emphasize this structure. Support for incremental rendering of tables and for rendering on non-visual user agents is also available. Special elements and attributes are provided to associate metadata with tables. They indicate the table's purpose, or are for the benefit of people using speech or Braille-based user agents. It is suggested that tables should not be used purely as a means to layout document content as it leads to various accessibility problems. To control layout style sheets should be used instead.
Comment. Change the title of 22.2 to
Mathematical Objects or Mathematical Expressions.
Rationale. Not all mathematical expressions are formulæ. Formulæ are a
"subset" of the possible mathematical expressions.
Mathematical and chemical formulæ pose similar problems to those posed by tables in that rendition may be of great significance and hard to disentangle from content. They also require access to a wide range of special characters, for most of which standard entity names already exist in the documented ISO entity sets (see further chapters 4 Characters and Character Sets and 25 Writing System Declaration).
Formulæ and tables are also similar in that well-researched and detailed DTD fragments have already been developed for them independently of the TEI. They differ in that (for mathematics at least) there also exists a richly detailed text-based but non-SGML notation which is very widely used: this is the TeX system, and the sets of descriptive macros developed for it such as LaTeX and AMS-Tex.
Comment. Modify "... such as LaTeX and
AMS-Tex to "... such as LaTeX, AMS-TeX and AMS-LaTeX."
Rationale. Correction in spelling (well, sort of; other possibilities are
TEX or TEX) and addition of other macros.
The AAP and ISO standards mentioned in section 22.1 Tables above both provide DTDs for equations as well as for tables, which at the time of writing are being updated and unified to form part of ISO 12083. There is also also a third and more general-purpose ongoing mathematical DTD effort known as EuroMath.
NOTE 125: Reference needed! FIX THIS NOTE BEFORE PUBLICATION. FIX THIS NOTE BEFORE PUBLICATION. FIX THIS NOTE BEFORE PUBLICATION. FIX THIS NOTE BEFORE PUBLICATION. FIX THIS NOTE BEFORE PUBLICATION. FIX THIS NOTE BEFORE PUBLICATION. FIX THIS NOTE BEFORE PUBLICATION. FIX THIS NOTE BEFORE PUBLICATION.
Comment. The following
is new (albeit only cosmetic) material.
Rationale. Response to NOTE. 125. (See later on the question of viability
of Euromath.)
Euromath is a project of the European Mathematical Trust to enhance research support for European mathematicians and to stimulate interchange among them by creating a research environment based on modern information technology. The project aims to produce both software and services.
As with tables, in all the SGML and XML solutions a tension exists between the need to encode the way a formula is written (its appearance) and the need to represent its semantics. If the object of the encoding is purely to act as an interchange format among different formatting programs, then there is no need to represent the mathematical meaning of an expression. If however the object is to use the encoding as input to an algebraic manipulation system (such as Mathematica or Maple) or a database system, clearly simply representing superscripts and subscripts will be inadequate.
The present Guidelines make no attempt to add to the number of available DTDs for representing formulæ. Instead, we recommend that the user make an informed choice from those already available. The additional tag set described in this chapter makes available only the following element, which should be used to encode any formula, no matter what notation is employed:
notation | supplies the name of a previously defined notation used for the content of the element. |
The legal content of a formula is determined by two factors. The parameter entity formulaContent supplies a content model for the element; while the notation attribute specifies what formal notation is employed by the element. The default value of the formulaContent entity is CDATA, but may be redefined in the document's DTD subset, for example as follows:
<!ENTITY % formulaContent "(#PCDATA)">
With this declaration in force, formulæ may contain parsed character data, (and may thus use entity references for special symbols and the like). An alternative might be to embed the elements defined by some other tag set, such as that of ISO 12083. 126
Comment. The following is new material.
Rationale. Response to NOTE. 126. [126. In this
case additional redefinitions may also be needed to avoid name clashes with existing TEI
elements. For further details see chapter 29 Modifying the TEI DTD.]
Comment. How viable is to (still) discuss ISO
12083 or EuroMath?
Rationale. Most (if not all) functionality of these initiatives can now
be found in OpenMath and MathML.
Secondly, when it is necessary to inform a processor that the content of a formula should not be treated as ordinary data, because it uses a different notation, a notation must be specified. Each notation used by a document must be declared in its DTD subset, as in the following example:
<!DOCTYPE tei.2 [ <!ENTITY % TEI.graphics "INCLUDE"> <!NOTATION TeX PUBLIC '-//Donald E. Knuth 1983//NOTATION TeX Mathematical Markup//EN'> <!ENTITY % formulaContent "(#PCDATA)" > ]>
With these declarations in force, a document may include formulæ expressed using standard TeX conventions, as in the following example:
<p>Achilles runs ten times faster than the tortoise and gives the animal a headstart of ten meters. Achilles runs those ten meters, the tortoise one; Achilles runs that meter, the tortoise runs a decimeter; Achilles runs that decimeter, the tortoise runs a centimeter; Achilles runs that centimeter, the tortoise, a millimeter; Fleet-footed Achilles, the millimeter, the tortoise, a tenth of a millimeter, and so on to infinity, without the tortoise ever being overtaken. . . Such is the customary version.<!-- ... --> The problem does not change, as you can see; but I would like to know the name of the poet who provided it with a hero and a tortoise. To those magical competitors and to the series <formula notation="TeX">$$ {1 \over 10} + {1 \over 100} + {1 \over 1000} + {1 \over 1000} + {1 \over 10,\!000} + \dots $$</formula> the argument owes its fame.</p>
The notation attribute supplies the name of a defined notation ( LaTeX ), which is associated by its declaration in the DTD subset with an external public entity. How that declaration is resolved will depend on the kind of processor in use, and is outside the scope of these Guidelines. The declaration for the formulaContent parameter entity in the DTD will determine how the contents of any formula elements will be parsed before they are handed to whatever procedure is intended to handle the external notation. When as here it is declared with a value of (#PCDATA), then entity references in the content will be resolved. This may be useful if the external mechanism uses characters which would otherwise be regarded as significant to the parser: for example, the less-than or greater-than signs. If a less-than sign or another delimiter occurs as content in a syntactically impermissible, it must be escape, such as by substituting an entity reference. For example, in this expression:
<formula notation='tex'> a<b </formula>
the less-than sign would not be correctly interpreted as a data character (in XML this is true in general; in SGML, only in certain context such as before a letter). Instead, represent the less-than character by an entity reference:
<formula notation="tex"> a<b </formula>
Comments. Use single quotes (in 'tex') above.
Also, "a<b" should be "a < b" .
Rationale. In most (all?), places in the TEI Specification, attribute
values are delimited by single quotes. A character entity reference should be placed (as
the above paragraphs suggests).
Of course, if the notation permits it (as TeX does), a simpler solution to this specific problem is to insert a space after the less-than sign:
<formula notation="tex"> a < b </formula>
Comments. Use single quotes (in 'tex') above.
Rationale. In most (all?), places in the TEI Specification, attribute
values are delimited by single quotes.
Entity references may only appear in the content of a formula element when the entity formulaContent has been redefined as above. In prior editions of these Guidelines, formulaContent was defined as CDATA, which in SGML means that the only parsing carried out is to search for an end-tag; the character sequence </ may not be followed by a > or a letter (or, with the CONCUR feature on, an open parenthesis) anywhere within a formula. If such a sequence does appear as data, or if an SGML notation is used, then CDATA does not suffice, and the parameter entity must be redefined appropriately.
Because XML does not include the CDATA element type (because it is one of the very few features of SGML that, if used, makes correct parsing in the absence of a DTD intractable), the present edition of these Guidelines instead defines the content of formulaContent as (#PCDATA).
The following SGML- and XML- based notations for encoding formulæ are recommended by these Guidelines:
<!NOTATION iso12083 PUBLIC 'ISO 12083:1993//DTD Formulae//EN'> <!NOTATION euromath PUBLIC '-//Euromath 1992//DTD Euromath equations//EN'>
NOTE 127: Particularly when using XML, the editors also recommend MathML: Ion, Patrick, Robert Miner (eds). Buswell, Stephen, Stan Devitt, Angel Diaz, Patrick Ion, Robert Miner, Nico Poppelier, Bruce Smith, Neil Soiffer, Robert Sutor, Stephen Watt (principal writers). Mathematical Markup Language (MathML) 1.01 Specification. W3C Recommendation, revision of 7 July 1999. http://www.w3.org/TR/REC-MathML
Comment. The following
is new material.
Rationale. Response to NOTE. 127.
Comments. Perhaps two new sub-sections (one for
each, MathML and OpenMath) are warranted. It is likely that they will, if they are
included at all, be moved to another place than where they are place at present.
Rationale. Significance of both MathML and OpenMath, and (although not a
very convincing argument) the fact that 22.3 has sub-sections.
MathML
Mathematical Markup Language (MathML) is a vocabulary for describing mathematical notation, capturing both its structure and content. MathML 1.0 became a W3C Recommendation on April 7 1998, becoming the first vocabulary based on XML syntax to reach that status. MathML 1.0 was revised and MathML 1.01 was announced as a W3C Recommendation on July 7, 1999.
MathML provides two types of markups: Presentation Markup, which captures the notational structure of an expression and could be seen as the "TeX for the Web" and Content Markup, which captures the mathematical structure of an expression. The majority of content elements correspond to a wide variety of operators, relations and named functions from topics typically found at the high school level mathematics.
MathML 2.0 became a W3C Recommendation on February 21, 2001. It added new elements for MathML Content Markup, thereby further strengthening the support for a "Semantic Math-Web", XML namespaces support, and provided an extension for W3C XML DOM along with OMG IDL implementation, ECMAScript and Java bindings. It also provided a version of MathML DTD compliant with Modularization of XHTML so that MathML fragments could be "embedded" in XHTML 1.1 documents (as data islands) and could be validated.
Suggested notation declaration:
<!NOTATION mathml2 PUBLIC '-//W3C//DTD MathML 2.0//EN'>
OpenMath
OpenMath is a project whose activities are coordinated by the OpenMath Society and is funded by the European Commission under the Esprit Multimedia Standards Initiative that commenced in September 1997. OpenMath is an emerging standard for communicating semantically-rich representations of mathematical objects both on and off the Web in a platform-independent manner.
The OpenMath Standard consists of specifications of (1) Structure of formulas (OM objects); (2) Semantical context (using Content Dictionaries); (3) Binary and XML encodings; (4) Formal standardization (ISO, other).
OpenMath and MathML have certain common aspects. They both use prefix operators, both are XML based and they both construct their objects by applying certain rules recursively. Such similarities facilitate mapping across both standards. There are also some key differences between MathML and OpenMath. OpenMath does not provide support for presentation of mathematical objects and its scope of "semantically-oriented" elements is much broader that of MathML, with the expressive power to cover virtually all areas of computational mathematics. In fact, a particular set of Content Dictionaries, the "MathML CD Group", covers the same areas of mathematics as the Content Markup elements of MathML 2.0.
Suggested notation declaration:
<!NOTATION openmath PUBLIC '-//OpenMath 1999//DTD for OpenMath Objects//EN'>
OMDoc is an extension of the OpenMath standard that supplies markup for structures such as axioms, theorems, proofs, definitions, texts (mixing formal content with mathematical text).
In-line versus block placement for an equation can be distinguished if desired, via the global rend attribute. The global n and id attributes may also be used to label or identify the formula, as in the following (imaginary) example:
<p>The volume of a sphere is given by the formula: <formula id="f12" n="12" rend="inline" notation="tex"> $$V = {4\over 3} \pi r^3$$ </formula> which is readily calculated.</p> <p>As we have seen in equation <ptr target="f12"/>, .... </p>
The formulaContent and formulaNotations parameter entities are defined as follows:
<!-- [DFTPE] 22.2: Formula Content --> <!ENTITY % formulaNotations 'CDATA' > <!ENTITY % formulaContent '(#PCDATA)' > <!-- end of [DFTPE] 22.2-->
The formula element itself is defined as follows:
<!-- [DFTFOR] 22.2: Formulae --> <!ELEMENT %n.formula; %om.RR; %formulaContent;> <!ATTLIST %n.formula; %a.global; notation %formulaNotations; #REQUIRED TEIform CDATA 'formula' > <!-- end of [DFTFOR] 22.2-->
Comment. Interchange 22.3 and 22.4. That is, 22.3 becomes "Overview of Basic Graphics Concepts" and 22.4 becomes "Specific Elements for Graphic Images".
Rationale. 22.3 uses some of the graphical formats in examples. The overview of 22.4 would be more useful if presented earlier.
The following special purpose elements are provided by this tag set to indicate the presence of graphic images within a document:
entity | names the external entity within which the graphic image of the figure is stored. |
Inclusion of a graphic image in a marked-up document typically requires three distinct steps:
In the TEI scheme, these three functions are carried out as follows.
Declarations for all notations used by a document must be provided within the DTD subset, as described above in section 22.2 Formulae. Many such notations are in common use; for details see section 22.5 Graphic Image Formats.
Entity declarations for the entities containing the graphics themselves must be made, using system or public identiifers, within the document's DTD subset, either directly or by including them within a suitable file, as in the example below.
<!DOCTYPE tei.2 PUBLIC "-//TEI P3//DTD Main Document Type//EN" "tei2.dtd" [ <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.graphics "INCLUDE"> <!-- Graphics notations used in this document ... --> <!NOTATION tiff PUBLIC '-//TEI//NOTATION Aldus Tagged Image File Format//EN' > <!NOTATION cgm PUBLIC 'ISO 8632-4:1987//NOTATION Clear text encoding//EN' > <!NOTATION bmp PUBLIC '-//TEI//NOTATION BMP: Microsoft Bitmapped graphic format//EN' >
<!-- The file 'figures.ent' contains entity declarations --> <!-- for all external entities needed by this document --> <!ENTITY % myFigures "figures.ent"> %myFigures; ]>
Comment. Modify the entries in the above example, from:
<!NOTATION tiff PUBLIC '-//TEI//NOTATION Aldus Tagged Image File Format//EN' > <!NOTATION cgm PUBLIC 'ISO 8632-4:1987//NOTATION Clear text encoding//EN' > <!NOTATION bmp PUBLIC '-//TEI//NOTATION BMP: Microsoft Bitmapped graphic format//EN' >
to
<!NOTATION svg PUBLIC '-//TEI//NOTATION W3C Scalable Vector Graphics Format//EN' > <!NOTATION png PUBLIC '-//TEI//NOTATION IETF RFC2083 Portable Network Graphics//EN' > <!NOTATION jpeg PUBLIC 'ISO DIS 10918//NOTATION JPEG Graphics Format//EN' >
Rationale. Include one example from each of the sections 22.5.1 - 22.5.3. SVG is an upcoming vector graphics format (with scope and user-base significantly larger than CGM), moving towards W3C Recommendation. PNG is vendor neutral and equipped with a more powerful compression scheme.
The file figures.ent will contain a series of declarations like the following:
<!ENTITY Fig1 system "fig1.cgm" NDATA cgm> <!ENTITY Fig1th system "fig1.bmp" NDATA bmp> <!ENTITY pullman system "pullman.tif" NDATA tiff>
the effect of which is to associate the name Fig1 with the external entity fig1.cgm, and also to declare that that entity uses the notation called CGM , which is declared in the DTD subset. In the same way, the external entity fig1.bmp is defined as using the BMP notation, and may be referenced by the name Fig1th (see further below).
Finally, the figure element is used to indicate the location of the graphic image in the text. For example:
<figure entity='Fig1'></figure>
Note that an end-tag is always required for this element. Three kinds of content may be supplied: the element head may be used to transcribe (or supply) a descriptive heading or title for the graphic itself as in this example:
<figure entity='Fig1'> <head>Figure One: The View from the Bridge</head></figure>
Figures are often accompanied not only by a title or heading, but by a paragraph or so of commentary or caption. One or more p elements following the head may be used to transcribe any caption or discussion of the figure in the source:
<figure entity='pullman'> <head>Above:</head> <p>The drawing room of the Pullman house, the white and gold saloon where the magnate delighted in giving receptions for several hundred people.</p> <figDesc>The figure shows an elaborately decorated room, at least twenty-five feet side to side and fifty feet long, with ornate mouldings and Corinthian columns on the walls, overstuffed armchairs and loveseats arranged in several conversational groupings, and two large chandeliers.</figDesc> </figure>
Here, the paragraph The drawing room ... several hundred people is transcribed from the source, while the description is provided by the encoder, for use by applications which cannot display the graphic directly. In documents created in electronic form with the needs of print-handicapped readers in mind, the figDesc element may be provided by the author rather than a subsequent encoder.
<figure entity='Fig1'> <head>Figure One: The View from the Bridge</head> <figDesc>A Whistleresque view showing four or five sailing boats in the foreground, and a series of buoys strung out between them.</figDesc> </figure>
Where the graphic itself contains large amounts of text, perhaps with a complex structure, and perhaps difficult to distinguish from the graphic, the encoder should choose whether to regard the graphic as containing the text (in which case, a nested text element may be included within the figure element) or to regard the enclosed text as being a separate division of the text element in which the graphic appears. In this latter case, an appropriate divn class element may be used for the text represented within the graphic, and the figure element embedded within it. The choice will depend to a large degree on the encoder's understanding of the relationship between the graphic and the surrounding text.
Like any other element in the TEI scheme, figures may be given identifiers so that they can be aligned with other elements, and linked to or from them, as described in chapter 14 Linking, Segmentation, and Alignment. Some common examples are discussed briefly here; full information is provided in that chapter.
It is often desirable to maintain two versions of an image in an electronic file: one a low resolution or `thumbnail' version which, when selected by the user, causes the other, high resolution, version to be accessed. In TEI terms, the thumbnail image acts as a reference to the other. Referring to the example above, we will assume that the entity Fig1th contains a thumbnail version of the full Fig1 entity. We can now embed a reference to that image using the simple ref element discussed in section 6.6 Simple Links and Cross References:
<ref target='IM1'>Click here <figure entity='fig1th'></figure> for enlightenment </ref> <!-- ... --> <!-- elsewhere in the document --> <figure id='IM1' entity='fig1'></figure> <!-- other figures here -->
Another common requirement is to associate part or the whole of an image with a textual element not necessarily contiguous to it in the text; this is sometimes known as a callout. Again, chapter14 Linking, Segmentation, and Alignment should be consulted for the full details of the mechanisms available for this purpose. This example assumes that we wish to associate one portion of the image held as fig1 with chapter two of some text, and another portion of it with chapter three. The application may be thought of as a hypertext browser in which the user selects from a graphic image which part of a text to read next, but the mechanism is independent of this particular application.
The first requirement is some way of identifying and hence pointing to sub-parts of a graphic image. This is most easily done using the extended pointer syntax discussed in section 14.2 Extended Pointers: thus
<xptr id='PD1' doc='Fig1' from='space (0 0 9 9)'> <xptr id='PD2' doc='Fig1' from='space (40 90 59 119)'>
These xptr elements identify two areas within the image Fig1 using the TEI extended pointer syntax. The first (with identifier pd1 ) is a square of size 10 by 10, tangent to the origin. The second (with identifier pd2 ) is a rectangle of size 20 by 30, starting at the point with co-ordinates (40,90) in the co-ordinate system used by this document.
The next requirement is some way of identifying the parts of the document to which a link is to be made. The most obvious way of doing this is to use the global id attribute:
<div1 type='chapter' id='C1'> <!-- text of chapter one here --> <div1 type='chapter' id='C2'> <!-- text of chapter two here -->
Now, all that is needed to linking these areas to the relevant chapters is a linkGrp element, as described in section 14.1 Pointers:
<linkGrp type='callout'> <link targets='C1 pd1'/> <link targets='C2 pd2'/> <!-- ... --> </linkGrp>
Further examples of this technique are provided in chapter 14 Linking, Segmentation, and Alignment.
The elements discussed in this section are defined as follows:
<!-- [DFTGRA] 22.3: Figures --> <!ELEMENT %n.figure; %om.RR; ((%m.Incl;)*, (head, (%m.Incl;)*)?, (p, (%m.Incl;)*)*, (figDesc, (%m.Incl;)*)?, (text, (%m.Incl;)*)?) > <!ATTLIST %n.figure; %a.global; entity ENTITY #IMPLIED TEIform CDATA 'figure' > <!ELEMENT %n.figDesc; %om.RR; %paraContent;> <!ATTLIST %n.figDesc; %a.global; TEIform CDATA 'figDesc' > <!-- end of [DFTGRA] 22.3-->
The first major distinction in graphic representation is that between raster graphics and vector graphics. A raster image is a list of points, or dots. Scanners, fax machines and other simple devices easily produce digital raster images, and such images are therefore quite common. A vector image, in contrast, is a list of geometrical objects, such as lines, circles, arcs, or even cubes. These are much more difficult to produce, and so are mainly encountered as the output of sophisticated systems such as architectural and engineering CAD programs.
Raster images are difficult to modify because by definition they only encode single points: a line, for example, cannot grow or shrink as such, since it is not identified as such. Only its component parts are identified, and only they can be manipulated. Therefore the resolution or dot-size of a raster image is important, which is not the case with vector images. It is also far more difficult to convert raster images to vector images than to perform the opposite conversion. Raster images generally require more storage space than vector images, and a wide variety of methods exists for compressing them; the variation in these methods leads to corresponding variations in representations for storage and transmission of raster images.
Motion video usually consists of a long series of raster images. Data compression is even more effective on video than on single raster images (mainly owing to redundancy which arises from the usual similarity of adjacent frames). Notations for representing full-motion video are hotly debated at this time, and any user of these Guidelines would do well to obtain up-to-date expert advice before undertaking a project using them.
The compression methods used with any of these image types may be `lossy' or `lossless'. Methods for lossy compression save space by discarding a small portion of the image's detail, such as fine distinctions of shading. When decompressed, therefore, such an image will be only a close approximation of the original. In contrast, lossless compression guarantees that the exact uncompressed image will be reproducible from the compressed form: only truly redundant information is removed. In general, therefore, lossless compression does not save quite so much space as lossy compression, though it does guarantee fidelity to the original uncompressed image.
Raster images may be characterized by their resolution, which is the number of dots per inch used to represent the image. Doubling the resolution will give a more precise image, but also quadruple the storage requirement (before compression), and affect processing time for any operations to be performed, such as displaying an image for a reader. Motion video also has resolution in time: the number of frames to be shown per second. Encoders should consider carefully what resolution(s) and frame rate(s) to use for particular applications; these Guidelines express no recommendation in this matter, save the universal ones of consistency and documentation.
Within any image, it is typical to refer to locations via Cartesian co-ordinate axes: values for x, y, and sometimes z and/or time. These Guidelines provide for this via the SPACE keyword of the extended pointing mechanism discussed in section 14.2 Extended Pointers. However, graphic notations vary in whether co-ordinates count from left-to-right and top-to-bottom, or another way. They also vary in whether co-ordinates are considered real (inches, millimeters, and so on), or virtual (dots). These Guidelines do not recommend any of these methods over another, but all decisions made should be applied consistently, and documented in the encodingDesc section of the TEI header.
NOTE 128: No special purpose element is provided for this purpose by the current version of the Guidelines. The information should be provided as one or more distinct paragraphs at the end of the encDecl element described in section 5.3 The Encoding Description.
The way in which the color of an image is rendered also varies greatly. In monochrome images every displayed point is either black or white. In gray-scale images, each point is rendered in some shade of gray, the number of shades varying from system to system. In true polychrome images, points are rendered in different colours, again with varying limitations affecting the number of distinct shades and the means by which they are displayed.
As noted above, there exists a bewildering variety of different graphics formats, and the following list is in no way exhaustive. Moreover, inclusion of any format in this list should not be taken as indicating endorsement by the TEI of this format or any products associated with it. With the exception of CGM, all the formats listed here are proprietary to a greater or lesser extent and cannot therefore be regarded as standards in any meaningful sense. They are however widely used by many different vendors.
Rationale. Modify the sentence "With the exception of CGM, all the formats listed here are proprietary ..." to "With the exception of CGM, PNG and SVG, all the formats listed here are proprietary ..."
Comment. JPEG is not proprietary, is it?
The following formats are widely used at the present time, and likely to remain supported by more than one vendor's software:
Comments. Rearrange the order of PNG and SVG
above. Strikethoughs represent the previous location; new locations are
represented in red. Add SMIL.
Rationale. PNG is misplaced. SVG has been moved to Vector Graphics
formats section. (See comments/rationale later.)
Comment. Should the above list be linked to the
description of appropriate formats?
Rationale. Easy access.
Brief descriptions of all the above are given below. Where possible, current addresses or other contact information are shown for the originator of each format. Many formal standards, especially those promulgated by ISO and many related national organizations (ANSI, DIN, BSI, and many more), are available from those national organizations. Addresses may be found in any standard organizational directory for the country in question.
For each format, a sample notation declaration is given, using a formal public identifier constructed from the best information available at the date of publication. It is recommended that such formal public identifiers always be used in the interchange of documents between sites. Unless otherwise noted, however, these formal public identifiers have been formulated by the TEI, and not by the owners of the notation, as is indicated by the use of `TEI' as the owner identifier. If more recent versions of these formal public identifiers, or versions promulgated by the owners of the notation, are available at the time of document interchange, they should be used in preference to those shown here.
Support for formal public identifiers varies somewhat among existing SGML and XML systems; for local processing, the notation declaration may therefore need to include a system identifier in addition to the formal public identifier. The documentation for the system in use should be consulted for details.
<!NOTATION cgm PUBLIC 'ISO 8632:1987//NOTATION Computer Graphics Metafile//EN'> <!NOTATION cgmchar PUBLIC 'ISO 8632-2:1987//NOTATION Character encoding//EN'> <!NOTATION cgmbin PUBLIC 'ISO 8632-3:1987//NOTATION Binary encoding//EN'> <!NOTATION cgmclear PUBLIC 'ISO 8632-4:1987//NOTATION Clear text encoding//EN'>
Comments. Add WebCGM to the list of graphical
formats or as a sub-section of CGM.
Rationale. Reasons to add: (1) WebCGM is an important vector graphics
format on its own. (Chris Lilley, W3C, has given some talks on the differences -- in goals
and intended audience -- between WebCGM and SVG.) (2) WebCGM is the evolution of CGM.
Reasons not to add: (1) This is a "specification" not a reference book on
computer graphics. (2) Need to consider whether such additions contribute to the verbosity
of the specification.
<!NOTATION WebCGM PUBLIC '-//TEI//NOTATION W3C WebCGM Format//EN' >
<!NOTATION pict PUBLIC '-//TEI//NOTATION PICT: Macintosh Drawing Format//EN'>
<!NOTATION SVG PUBLIC '-//TEI//NOTATION W3C Scalable Vector Graphics Format//EN' >
<!NOTATION tiff PUBLIC '-//TEI//NOTATION Aldus Tagged Image File Format//EN'>
<!NOTATION gif PUBLIC '-//TEI//NOTATION Compuserve Graphics Interchange Format//EN' >
<!NOTATION pbm PUBLIC '-//TEI//NOTATION Jeff Poskanzer, Portable Bit Map//EN' >
<!NOTATION pcx PUBLIC '-//TEI//NOTATION PCX: ZSoft IBM PC Raster Graphics Format//EN' >
<!NOTATION bmp PUBLIC '-//TEI//NOTATION BMP: Microsoft Bitmap Graphics Format//EN' >
<!NOTATION PNG PUBLIC '-//TEI//NOTATION IETF RFC2083 Portable Network Graphics//EN' >
<!NOTATION SVG PUBLIC '-//TEI//NOTATION W3C Scalable Vector Graphics Format//EN' >
Comment. Modify "It is defined by W3C
Candidate Recommendation, Scalable Vector Graphics (SVG) 1.0 Specification, 02 November
2000, available at http://www.w3.org/TR/SVG/." to "It is defined by the Scalable
Vector Graphics (SVG) 1.0 Specification, W3C Recommendation, 04 September 2001, and is
available at http://www.w3.org/TR/2001/REC-SVG-20010904/."
Rationale. Change in the status of SVG. The URL given was non-canonical.
<!NOTATION JPEG PUBLIC 'ISO DIS 10918//NOTATION JPEG Graphics Format//EN' >
Comment. Modify "JPEG: Joint Photographic
Expert Group" to "JPEG: Joint Photographic Experts Group".
Rationale. It is "Experts" and not "Expert".
<!NOTATION QuickTime PUBLIC '-//TEI//NOTATION Apple QuickTime Video Data Format//EN' >
<!NOTATION pcx PUBLIC '-//TEI//NOTATION Eastman Kodak Photo-CD Raster Graphics Format//EN' >
As noted above, the reader will encounter many, many other graphics formats. Other formats are not recommended for data interchange according to the TEI scheme at this time, but may be included in a TEI document without affecting its conformance in other respects, provided that a notation declaration is provided.
Comment. Add SMIL to the list. The
following is new material.
Rationale. SMIL 1.0 and its successor SMIL 2.0 are both W3C Recommendations.
<!NOTATION SMIL PUBLIC '-//TEI//NOTATION W3C Synchronized Multimedia Integration Language Format//EN' >
Commented Version By Pankaj Kamthan, kamthan@cs.concordia.ca.