A Statistical Engine within an XSL Translator:S and Xalan
Duncan Temple Lang
01/
2001
-
The Basics of XML and XSL
-
Connecting S and XSL
- Initializing R
- Built-in S Functions
- Find Functions
- Registering New Functions
- In XML
- In S
-
Converting Non-Primitive Objects
-
Examples
-
Embedding Other Systems
- Omegahat/Java
- Access to relational database management systems.
- Python.
- Perl.
- When Embedding isn't Possible
-
Other XSL Translators
-
Issues and Questions
-
Examples
- Plots
- Conditionals
- Footnotes
Abstract
XML will prove to be an
important tool for the statisticians, both in how data are exchanged
and also in publishing. Research papers, data analyses and
educational material can all be developed in a way that better
supports the inclusion of code and output from statistical
software. XSL is the tool
that converts the XML to a more readable form targeted at human
readers. The ability to dynamically create and integrate the output
from statistical procedures into documents can benefit greatly by an
interface between XSL and the statistical software. We describe the
simple mechanism by which the S language is embedded within the XSL
translator Xalan and how
this can be used to create reproducible documents that are in many
senses also reports and programs.
Xalan provides a convenient mechanism for adding functions to the XSL
engine. (At this time, there is no way to add new directives as
elements.) We can add any number of functions which map to R
functions, and also allow calling arbitrary S functions. For example,
we may provide access to mathematical functions such as
exponentiation, log, etc.; random number generators; file access;
string manipulation; model fitting; and so on. Additionally, we may
provide access to the evaluator by providing an
eval function in XSL which passes the
string (or XML nodeset) to the S evaluator.
The idea is that people write XSL rules with which to process nodes in
an XML document. These rules can call S functions as if they were
built-in functions with the XSL translator. Calling these functions
can be used to generate output in R that is inserted into the XML
being created. Additionally, one can call S functions within XSL
conditional expressions in order to control the XSL processing and
what appears in the target document.
One can call individual functions and also evaluate S expressions
provided as strings. The arguments to these calls can be literal
values specified in the XSL file or parameters specified on the
command line. Additionally, the inputs can come from the XML file
being processed and the nodes available to the XSL rule. And finally,
the input values can be variables in S that were created in earlier
calls. This allows us to integrate inputs from a variety of different
sources and to also treat the S session as a worksheet with connected
data that persists for the duration of the transformation.
1. Initializing R
When one calls an R-XSL function (either directly or indirectly), the
R session will be initialized. By default, this is done with a single
argument --silent which avoids verbose output from R.
Initializing the session in this fashion will typically work well
for most applications. The usual R startup will take place, including
looking for a .Rprofile, etc.
Check if .Renviron is read in a shell script
or from the C code
If one needs to do some computations before others are performed such
as initializing variables, loading libraries, etc. then one can do
these in the XML or XSL file via <code> tags.
However, there may be occassions in which one needs to control
how the R session is initialized.
We support several ways to this.
The first two of these are implemented, and the third is just a design.
- Command line
-
One can identify arguments on the command line that are to be passed
to R using the --R argument. All arguments after this
(and before any other flags identifying arguments for another system)
are then gathered together and passed to R when it is initialized.
Rxslt -in foo.xml -xsl foo.xsl --R --gui=none --no-site-file
- r:init()
-
We provide an XSL function that initializes
the R engine, passing the arguments to this XSL function
as command line arguments for R.
One can call this function from within an XSLrule.
For example,
<xsl:template match="article">
<xsl:if test="r:init('--gui=none','--no-site-file') < 0">
Error initializing R
</xsl:if>
</xsl:template>
Note that we don't have an argument 0 referring to the name of the
program by which the process was invoked.
- XML
-
Finally, one can put the arguments in an XML file and pass this to
Rxslt.
The file, say Rinit.xml, should look something like
<s:init>
<arg>--gui=none</arg>
<arg>-min-vsize=4M</arg>
</s:init>
Then, it one identifys this as a file that should be read to
get the commands for initializing S
using the -R-init-file
as in
Rxslt -in foo.xml -xsl foo.xsl -R-init-file Rinit.xml
The advantages of the different approaches relate to what is changing
most frequently. If one needs to specify different initialization
arguments per run, the command line works well. If one needs
different initialization arguments for different XSL filters, putting
them into the specific XSL files using
r:init is the simplest way to do this.
And it one wants to re-use the initialization steps across different
XSL and XML input files, specifying them in a separate, reusable file
via the -R-init-file is most convenient.
2. Built-in S Functions
We have added some specific functions that you can call
as if they were built-in XSL functions.
These are
- eval
-
- source
-
This calls a special version of the source
which executes the standard one and discards the return value.
- sqrt
-
Invokes the square root function. It should be given a number
which can be specified either as a literal, or as the return value
from invoking the number function in XSL.
- date
-
Calls the date function, returning a string containing the
current date and time.
It is more efficient and flexible to explicitly register these
routines with the XSL translator rather than letting Xalan not find it
and assume it is an S function. The mechanism is easy
and it can be done by simply adding the name of the S function to a
list that is added when the XSL transformer starts.
3. Find Functions
We can extend the mechanism for
resolving external function references in the Xalan engine. We can do
this by providing a method for the XSLT engine's mechansim for
matching names to functions and having it query S to see if there is a
function bound to the name of interest. For example, if the author of
the XSL input calls foo, we will look
up the table of external functions. If it is not a built-in function
(either for the basic XSL translator or our extended version of it)
and hence not in the table, we then use a second lookup approach by
using the C-level equivalent of
get("foo", mode="function")
The functionAvailable in
XPathEnvSupportDefault appears to be the one of
interest. We want to extend this class and have an instance of our
S-specific version be instantiated. The implementation of the method
in the new class is simply a call to the inherited method. If this
returns false, we call the exists function in
S to see if such a function exists. We also extend the method
findFunction to actually find the
function. We can be smart about storing the information about which
functions are found by the call to exists in
S and which are regular XSLT functions.
The question that remains is how to cause our new class to be used.
The main currently creates an instance of
XPathSupportDefault
It appears that we can extend the class
XSLTProcessorEnvSupportDefault and change
its instance of the m_defaultSupport.
Then, if we can set this field, we can override the way that we find
functions. Part of the current problem is this delegation. The
XSLTProcessorEnvSupportDefault has methods
for findFunction and
extFunction but it delegates these calls
to the the XPathEnvSupportDefault instance
it has as a field. If we can set that field with an instance of a
class that extends XPathEnvSupportDefault
and performs the lookup mentioned above, then all will work.
It turns out that this is difficult to achieve because of the way
the Apache code is structured. While we can override the
XPathEnvSupportDefault, we cannot set the
field m_defaultSupport in the
XSLTProcessorEnvSupportDefault since it is
private. By making it
protected or by providing a set
accessor method for it, we can set it in the constructor. However,
this causes problems since the assignment operator for the
XPathEnvSupportDefault class is private
(and is not implemented).
The current solution is to extend the
XSLTProcessorEnvSupportDefault class and
override the extFunction method. We call
the default one (which uses delegation to the
m_defaultSupport) and catch the exception
this throws if the function is not found. In the exception handler,
we instantiate a new NamedRFunction and
execute it. This does incur the overhead of an exception throw
which would nicer to avoid.
The <paste> example in
test/report.xml and test/report.xsl
demonstrate how to use this.
4. Registering New Functions
As with all inter-system interfaces, there is an issue about how to
convert XSL arguments to S values and how to return the result of the
function. In many cases, we know a great deal about the functions and
what type of arguments they expect and return.
We allow the user to "register" these functions.
The use can specify
- the name of the function
- the number of arguments
- the details of the arguments, consisting of any or all of
- the argument name
- a default value
- a type
In this context,
there are 5 possible types of values
for the arguments:
- node lists
- result tree fragment
- boolean
- number
- string
Currently, one specifies the details of the S-XSL functions in a
separate file. These are first processed by the XSL translator by
applying a specialized (and perhaps compiled) XSL stylesheet to read
them. These register the functions with the XSL translator. Then the
real target file is processed.
In the future, we may decide to extend the XSL elements to support an
xsl:register element. This would allow the
registration of the functions to be performed within the XSL file (or
one that is included in the main XSL input file). We are hesitant to
do this as it would make our XSL non-standard. However, the
developers of Xalan have indicated that it will support element
extensibility and so this will be more common.
We should note that specifying all the details of function may be
cumbersome and inconvenient.
A simpler version of the registration would allow a signature to be specified
without detailing the argument names, etc.
For example,
<sxsl:function>
<name>round</name>
<signature>N,I</signature>
<signature value="N,I"/>
</sxsl:function>
For the moment, we require the full form. Since this information is
specified infrequently and one can avoid registering the function at
all, this hardly seems like a serious problem. In the future we may
add some shortcuts.
Alternatively, we can use S's reflectance mechanism and its ability to
create XML document fragments to create much of the registration
specification. Using the S4 methods, one also has information about
types of arguments and the specification can be completely determined
by S.
We can also provide facilities for registering XSL functions from
within S. These would add the entry to the XSL translator's function
table, making them available to the XSL engine and the stylesheet.
The purpose behind this interface is to provide S programmers with a
familiar and powerful mechanism to specify the details of these
functions. It is more convenient to specify information about
functions within that language, as this can be done programmatically
rather than manually. Also, it is more convenient since S is a more
powerful programming language than the XML/XSL combination.
The basic idea is simple. We provide a function,
registerXSLFunction which allows the caller
to specify the details of the XSL function to be registered. These
include the name, signature, return type and argument names and
default values.
registerXSLFunction <-
function(name, returnType=NULL, signature="", names=NULL, defaults=NULL) {
.Call("",)
}
It is obvious how to convert numbers, strings and logical values
between XSL and S. This gets slightly more complicated when returning
vectors of these types containing more than one value from S to XSL.
And it is significantly less obvious as to how to handle non-primitive
objects such as named vectors, lists and objects with a class
attribute. We use a relatively simple and hopefully appropriate
approach at present. (It can be easily changed if anyone has any
better suggestions.)
Suppose we execute the expression
summary(x)
where x is a numeric vector. The result is
an object of class table containing
different quantiles and the mean of the collection of numbers. We
want to convert this object to an XML node, or more precisely to a
fragment of a document, that can be inserted into the XML document
being created by the XSL translator.
How do we do create this XML from S? Simple. We arrange to have a
method (or part of the standard output or characteristic of the
default input reader) that converts the result objects to XML.
It does this in the obvious manner:
- if there is a method for converting the specific object
to an XML fragment, it uses that; otherwise,
- it uses a default approach by creating an XML representation
by recursing the structure of the S object.
When the S representation of the XML fragment is created, this is
passed to the C++ code that brokers the interface between
the XSL translator and S, and this converts it to a
ResultTreeFragment.
Let's consider the following simple XML element.
<s:output>
<code>summary(rnorm(10))</code>
</s:output>
The idea is that we want the output from the S expression
summary(rnorm(10))
to be inserted in place of the code.
We can arrange for this to happen with the
following XSL rule:
<xsl:template match="s:output">
<xsl:value-of select="r:eval(string(./code))">
</xsl:template>
1. Omegahat/Java
No need, since we can integrate it directly with the Java version of
Xalan.
2. Access to relational database management systems.
We might want include the output from SQL expressions in Relational database
servers within a document.
3. Python.
4. Perl.
5. When Embedding isn't Possible
Why are we
implementing this functionality when it involves writing low-level
code and requires additional thought by the authors. The answer is
relatively simple and is itself a question - what's the alternative?.
The same effect could be achieved by passing the original XML
document to S and have it parse the contents (using the XML package) and
processing each of the code sub-nodes
within the tree by evaluating their contents and substituting
it with the XML representation of the output.
Since we could do that in the absence of the tools described in this
document, one can still do that. The difficulty with it is that it is
clumsy, and more importantly the two filters cannot communicate with
each other in a backward-forward interaction. Firstly, we cannot use
S functions within constructs such as logical expressions within
test clauses of
if, when, ...
elements or the select or
match attributes of
value-of or
template elements.
Perhaps more importantly, while XSL encourages
locality in its rules, the XSL templates can access other nodes in
the document while processing a node. In the two-filter approach, the
S filtering cannot access the other nodes without considerable effort
which would amount to providing an implementation of XSL (or some of
its features) within S
Also, the two-filter approach, as with so many inter-system interfaces uses
strings to communicate between the two stages. This reduces the
information within the system by removing the types of the values.
When that application cannot be embedded in the XSL translator,
we can use a Remote Method Invocation mechanism to objects in a
separate process using CORBA, RMI, DCOM, etc. Additionally, we can
embed the XSL translator into other applications if we can either a)
recompile that application, or b) dynamically load code into
it. Neither is obviously feasible for all applications.
Sablotron
When we use strings in XSL to represent variable names in R
we cannot tell the difference between a literal and
a variable name unless we require uses to specify
string literals. For example, in the XML snippet
<tag attr1="x" attr2="'x'">...</tag>
the first attribute is intended to be a variable name and
the second is a literal string.
We may want to experiment with each in examples to determine what is
the useful default?
Alternatively, we can introduce a function,
e.g. r:variable
which would take a name as input and return an object that would tag
it as a variable. The issue here is how to represent that within the
XSL translator (as a return value).
And another, perhaps signficantly simpler, approach
is to use substitute
and do.call and symbolic friends.
1. Plots
We can use an XML document as a report template. We can then
regularly create an actual report from the template based on the
current data, such as the daily values, etc. For example, we take a
simple document that produces summaries of two variables obtained from
a database when the document is generated. It provides summary
statistics for both variables, a histogram of each variable and a
scatterplot. It writes the data
Note that this can be done in a regular programming language such as
S, Perl, etc. However, it may not be as convenient since we are
removing the document layout aspect from that environment. We cannot
for example easily spell-check any text in the document. We cannot
edit the document easily. By having inserts into the document, we can
allow the authors to use these code segments in different ways. This
allows people to easily change the appearance of the report without
re-programming in lower-level languages.
2. Conditionals
We can use S within the conditions of <if> and <choose> statements/operations. We might want
to insert text conditionally based on the value of the specified node.
For example, suppose we are processing temporal data
and want to render values that are earlier than a specified date
by coloring it red, while observations after that time period will be blue.
We may provide this date as a parameter or have it as a value in S.
Then, a command
<xsl:element name="font">
<xsl:attribute name="font">
<xsl:if test="r:compareDates(,$cutOff) < 0">red</xsl:if>
<xsl:if test="r:compareDates(,$cutOff) > 0">blue</xsl:if>
</xsl:attribute>
<xsl:apply-templates select="date">
<xsl:apply-templates select="value">
</xsl:element>
-
These correspond to the types that
are supported by the XObject class.
-
This is feasible, especially since
XSL is written in XML and so we can read both the XML and XSL
documents. Implementing the XSL actions requires significantly more
labour