If we want to use XML as one mechanism to exchange data between
applications and processes, we will have to be able to both parse it
and also generate it. Packages for parsing XML in R and S are
described elsewhere.
Here we discuss how we might generate XML output and the associated
tools needed to this generically.
Basic Mechanism - object output.
We define a method toXML()
. This is a
generic function. The class-specific versions of this are responsible
for returning a string containing the appropriate XML output.
XML Streams
A more challenging, and potentially flexible, facility is to have an
object that acts as an XML filter. It takes objects as input at
different times in the session and generates XML output appropriate
for the object. Depending on how it is created, it then passes this
new output to one or more listeners so that it can be rendered,
stored, transmitted, etc. In this way, we may have the output appear
in a browser as the object is displayed in the session. The browser
may mark up the XML in interesting ways to assist the user in, for
example, displaying the session in outline mode, connecting variables
as input to and output from different commands, clicking on an icon to
activate the associated graphics device, etc.
filter$write(read.table())
filter$write(1:10)
filter$write(factor(rbinom(100)))
A filter as described above needs to create well-formed and valid
XML. To do this, it must have some knowledege of a DTD to use. There
two possible ways to do this. One is to create functions and data
structures that have a particular DTD encoded in them. The
alternative is to have a general mechanism for reading DTDs and
interpreting them. The former requires work to be done for each DTD
and also causes potential problems regarding synchronization between
the external description of the DTD and the local datastructures.
Thus the second is preferred. This allows us to write some very
general facilities which operate on arbitrary DTDs and validate
content by reading the DTD description itself.
This is simarl to the style used in the emacs mode -
PSGML.
We can access the information within a DTD locally using the
parseDTD()
function and the argument of the same name to
xmlTreeParse()
. The DTD elements returned by both are
identical, so we describe the value returned by parseDTD()
.
Before this, we give a very brief overview of what is in a DTD
and what we can expect to see in the user-level objects
parseDTD()
This function takes the name of a file which is expected to contain a
Document Type Definition (DTD). This file is parsed and the resulting
tables of element and entity defintions are converted to lists of
user-level objects. The return value is a list containing two
sub-lists, one for the elements and one for the entities.
(In the case of a the DTD being returned via the function
xmlTreeParse
, both the internal and external DTD
tables are returned. Each of these is as described here.)
Entities
There are two types of entities - internal and external/system
entities. The former are used as simple text substitution or macro
facilities. They allow one to define a segment of text to be used in
a document or elsewhere in the DTD (such as attribute lists) that are
used in several places. Rather than repeating the text and having to
modify multiple instances of it should it need to be changed, one uses
entities to parameterize the segment.
Internal entities are defined something like
<!ENTITY % foo "my text to be repeated">
Internal entities of this form are converted to user-level objects of
class XMLEntity
. Each of these has 3 fields. These are
the name
which is the identifier used to refer to the
entity. The value
field is the expansion of the macro.
The orig
field is the unexpanded value which means that
if the value contains references to other entities, these will not be
expanded.
For example, the entries in the DTD
<!ENTITY % bar "for R and S">
<!ENTITY testEnt "test entity &bar;">
produces the XMLEntity
object
$name
[1] "testEnt"
$value
[1] "test entity for R and S"
$original
[1] "test entity %bar;"
attr(,"class")
[1] "XMLEntity"
The names of the entities list uses the names of each the entities.
External Entities
External entities are similar to regular internal entities but refer
to text expansions that reside outside of this file. The location may
be another file or a URL, etc. These are returned with a class
XMLExternalEntity
. This has the same fields as the class
XMLEntity
but the interpretation of the
value
field is left to the user-level software.
One can use scan()
, url.scan
, and other
functions for reading the value of the remote content.
Elements
While the entities usually appear at the start of the DTD and are
important for building flexible, useful DTDs and documents, the most
important aspect of a DTD is the collection of elements that define
the structure of a document that "obeys" the DTD and how the different
pieces (nodes) of the document fit together. These are element
definitions and each specifies firstly, the list of attributes that
can be used within that element and their types, and secondly what
other elements can be nested within this one and in what order. We
will not try to explain the structure of a DTD in this document. See
W3.org for resources explaining
the structure at various different levels.
A basicelement definition has the following components
<!ELEMENT name content>
The name is the text used to introduce it in an XML document as in
<name> </name>
<name />
The content is the most complicated aspect of an element, but it is
relatively simple to understand in most cases.
It is used to indicate what are the possible combination of elements
that can be nested within this element. It allows the author of the
DTD to specify an ordering of the sub-elements as well limited control
over the number of such elements one can use in any position.
The three basic structures used in the content definition
are
- another element,
- a set of elements of which one can be used, and
- an ordered sequence of elements and composite structures,
Each of these three can be qualified by an occurance qualifier which
controls the number of such types to expect in this position.
- by default, just one is expected.
-
(content) +
means that at least one is expected, but there can
be any number of structures matching this content description
after the first one.
-
(content)*
means that there are 0 or more
expected.
-
(content)?
means zero or one.
The following example illustrates all of the basic features
<!ELEMENT entry3 ( (variables | (tmp, x)), (record)* , (a*, b,c,d, (e|f)) , (foo)+ ) >
Here we define an element named entry3
. This has 4 basic
types that can be nested within in, and in a specific order. First,
we must have a variables
element or the pair
tmp
followed by x
. There should be exactly
one of either of these entries.
This is followed optionally by any number of record
element instances.
After this, there must be a sequence of
element instances
a
, b,
c
, d
and either of e
or f
.
And finally, we can have one or more foo
entries, but at
least one.
The attributes an element supports are listed separately
via a ATTLIST
element
<!ATTLIST element-name
attributeId type default
...
>
The structure returned from parsing and converting a DTD to a
user-level object is quite simple. It is a list of length 2, one for
the entities and the other for the elements within the DTD. If the
DTD object comes from a document, it separates the entities and
elements defined locally or internally in the document and those in
the external DTD if there is one. This results in a list of length 2
which contains the internal and external DTDs. Each of these is then a
list of length 2 with the entities and elements.
The entities element in a DTD is a named list. The names are the
identifiers for the entities.
Each entry in this list is an object of class
XMLEntity
or XMLExternalEntity
.
In either case, each has 3 fields. name
,
content
and original
.
The name is the identifier of the entity.
The value is the text used to substitute in place of the entity
reference. The original
field is for use when reproducing
or analyzing the DTD. If the value contains references to other
entities, this field reflects that and is the unexpanded or literal
version of the entity definition as it appears in the DTD document.
The elements list is also a named list, with the names being those of
the elements. Each entry in the list is an object of class
XMLElementDef
.
These contain 4 fields:
-
name
- the name of the element.
-
-
type
- this will almost always be
1
indicating an
ELEMENT_NODE. An explanatory string is used as the name for this
integer enumeration value.
-
-
contents
- This is an object defining the restriction on the sub-elements
that can be nested within this element.
This is of class
XMLElementContent
and has 3
fields:
-
-
type
- named integer value (with name providing a description of
the meaning) indicating what type of content. The usual
ones are
PCData
, Sequence
,
Element
, Or
, and so on.
-
-
ocur
- named enumerated value indicating how many instances
of this content are expected and admissable.
These are
Once
, Zero or One
,
Mult
and One or More
.
-
-
elements
- A list of
XMLContent
objects
that describe the feasible sub-elements within this
element being defined.
These are usually specializations of the class
XMLContent
: XMLOrContent
, XMLElementContent
,
XMLSequenceContent
.
These have the same structure, just different meaning and semantics.
-
-
attributes
- a named list of
XMLAttributeDef
objects, with the
names being those of the attributes being defined for this
element.
The result of converting the definition of entry3
above is given below. It is an
object of class XMLSequenceContent
.
Hence, its type
field is a named
integer with value 3
and name Sequence
.
Since the entire content has no qualifier, the ocur
field is Once
.
Now we look at the sub-elements, accessible from the
elements
field.
This is a list of length 4, one fore each term in the sequence.
The classes of the objects may help to explain its structure.
sapply(d$elements$entry3$content$elements,class)
[1] "XMLOrContent" "XMLElementContent" "XMLSequenceContent"
[4] "XMLElementContent"
Let's look at the third entry, the XMLSequenceContent
object.
r <- d$elements$entry3$content$elements[[3]]
Again, this is a sequence. Its sub-entries are of diffrent content
classes.
sapply(r$elements, class)
[1] "XMLElementContent" "XMLElementContent" "XMLElementContent"
[4] "XMLElementContent" "XMLOrContent"
The first 4 are reasonably obvious. These identify single elements
and are the primitive content types.
> r$elements[[1]]
$type
Element
2
$ocur
Mult
2
$elements
[1] "a"
attr(,"class")
[1] "XMLElementContent"
We see that the expected type is a
and that there can be
zero or more of these.
The more interesting entry is the last one.
Its primitive display is given below.
$type
Or
4
$ocur
Once
1
$elements
$elements[[1]]
$type
Element
2
$ocur
Once
1
$elements
[1] "e"
attr(,"class")
[1] "XMLElementContent"
$type
Element
2
$ocur
Once
1
$elements
[1] "f"
attr(,"class")
[1] "XMLElementContent"
attr(,"class")
[1] "XMLOrContent"
We see that it is of type Or
and that we expect exactly
one instance of it. It is interpreted by expecting any of the content
structures described in its elements
list. Each of these
is a simple XMLElementContent
object and so is a
"primitive".
Back to the Filter
Armed with contents of a DTD, generating XML output via a filter can
now be validated easily. Suppose the following command is
issued via the filter. (These will most likely be done indirectly via
higher-level commands.)
filter$output("variable", c(unit="mpg"), value)
Then, the filter will check its current state, specifically the
last open/unfinished element, and examine its content specification.
If the previous command was something like
filter$open("variables", numRecords=nrow(data))
then the filter will extract the list of possible entries for this
tag.
dtd$entries[["variables"]]$contents$elements
Then it determines whether the element variable
can be added.
In the case of a dataset, this is a simple lookup.
The only acceptable value is a variable
element.
> d$elements$variables$contents
$type
Element
2
$ocur
Mult
3
$elements
[1] "variable"
attr(,"class")
[1] "XMLElementContent"
Duncan Temple Lang,
duncan@wald.ucdavis.edu
Last modified: Mon Sep 30 10:46:17 EDT 2002