Abstract
This document describes the philosophy that motivates the use the
RDCOM packages and contrasts that with other approaches.
Specifically, it describes the need for being able to expose S
functionality and values as regular COM objects rather than simply
expose the R command interpreter.
Inter-system interfaces are very attractive. They allow programmers
in one language or system to readily use code in another language or
system with very little overhead. They do not have to translate the
original code into their own language. Instead, they only have to
install the inter-system mechanism and invoke the relevant code.
There are a lot of these inter-system interfaces available now, and
this is also true for S, and specifically R. We can call Java, Perl,
Python, Octave, XLisp, Objective-C, (D)COM from S
and, importantly, vice-versa.
When people think of making S available to other programming
languages, they typically suggest that one should be able to send S
commands from these other languages. S will evaluate them and one can
generate plots or write to files, and in some cases even get back the
results from evaluating the command. As an example, one might write
some Visual Basic code to generate some random numbers in R.
Dim x As Object
R = CreateObject("R.Evaluator")
x = R.evaluate("rnorm(10)")
|
---|
So far this is quite easy.
The array is sent back to the Visual Basic function
from R and the values are immediately available.
There are four ways in which this style of computing
can be limiting.
<java:class>java.awt.ActionListener</java:class>IConnectionPoint
-
The programmer must know the S language and its syntax to be able to
construct commands. This defeats the purpose of programming in this
other language. Instead of having a single programming language to
deal with and debug, the programmer must switch between two and deal
explicitly with both. While it is beneficial to have the additional
functionality, the interface places the burden on each user of the
interface, not the interface itself. Instead, we want the programmer
to be able to think of the additional functionality in the usual terms
of her existing and familiar programming language. So in Visual Basic,
for example, we want S functions to appear as if they are VB routines.
-
In addition to having to know the S language, the programmer in the
other system must also construct these commands as part of their code.
They must use data that is local to their program to parameterize the
S commands to get the desired behavior and they must do this by
pasting together elemens of the command as strings. For example,
suppose in our example above that we wanted to generate a sample of
size n rather than the fixed and constant value
10. n is a variable in the local program, say in
Java. In this case, we have to construct the command
as a string something like
<java:code>
cmd = "rnorm(" + n + ")"
</java:code>
If we have three arguments to the function (e.g. the mean and standard
deviation), we would do something
<java:code>
cmd = "rnorm(" + n + ", " + mean + ", " + sd + )"
</java:code>
This starts to get more complex to type and match the commas,
etc. Essentially, we are writing one language in another and
this demanding but tedious.
It is conceptually quite simple, as is a lot of programming,
but the opportunities to make a mistake abound.
Unbalanced parentheses, missing commas, etc.
are common. If a local function deals with an arbitrary number
of arguments and puts these into the command, it must explicitly
loop, check whether we put a comma or are we at the final argument,
etc.
-
A far more important and technical problem arises in how we represent
data from the local system in the command as a string. For example,
suppose we have a <java:class>javax.swing.Window</java:class> in Java
and need to pass it to R. Clearly, there is no fool-proof way to put
this into the command represented as a string. One can ask what good
it would do R to have access to this. Firstly, if R can call Java
methods, then the R code might invoke a method on the Window object as
part of its actions. More straightforwardly, R might simply pass the
Window object to another Java method. If we cannot pass the Window
object to the R function, we would have to have written the original
Java code to put the object into a global variable and have the other
Java method look there. So we might use global variables, rewrite the
original code, or alternatively implement some scheme for passing an
identifier to R which can be resolved back in Java by the relevant
Java methods. All of this greatly detracts from the use of the
inter-system interfaces. It limits the utility of the interface to
computations that involve data that can be represented as a string, or
it involves rewriting the existing code or using some ad hoc, homespun
techniques to manage the communication.
To be useful, we must be able to pass objects between the systems. And
for those objects that have no corresponding meaning in the remote
system, we must be able to pass a reference to the original object.
-
One of the major drawbacks of the interface that only allows commands
to be evaluated is how we return complex, i.e. non-primitive, values.
For example, suppose we use R to fit a linear model via the
lm function. Again, we have an issue of how we can
pass the data to R.
We can represent it explicitly in the command something like
R.evaluate("lm(y ~ ., data = data.frame(y = c(1.2, 3.2,...), x1 = c(...), x2 = c(...)))")
|
---|
Alternatively, we could create the dataset in R ahead of time
using several commands
R.evaluate("myData <<- data.frame()")
R.evaluate("myData$y <<- c(1.2, 3.2,...)")
R.evaluate("myData$x1 <<- c(...)")
|
---|
and so on. Which is more convenient depends on the context.
But note that we now have two scopes or evaluation contexts to worry
about: R's and the local one. The data lives in local variables
and we are explicitly creating R variables in the R session.
We have to be very careful to choose names that don't overwrite
existing data used in other routines that manage data in this
same way. Again, we are using global variables, albeit in
a different system, and have sacrificied modularity
and maintainability in our code.
With this issue aside, how do we return the result of the linear model
fit. Clearly there is no obvious way to bring it back with all its
meaning to the original system.
In some of the inter-system interfaces,
we know the two languages
involved, e.g. R and Perl.
In that case, we can use a default approach which is
to copy the elements of an R list, e.g. the lm object,
to a Perl object with the same elements. In this way,
we do a deep recursive copy.
This will get the data contents across but not necessarily
the semantics of the object.
In the case of COM, we don't have any information about
the client to which we are returning the value.
In this case, we must keep the data on the server in which it
was created and present it to the client in a suitable form.
In other words, we must create a COM object that provides the
client with access to the values.
To do this in S, we obviously must have a mechanism to
create a regulr COM object which can be used
to access the elements of the object and even call methods on it.
Without this facility, the client must be careful to ensure
that non-basic values are returned.
It must do this by assinging the results to R's global environment/workspace
and always returning a primitive value.
In the case of the lm
above
In some cases, we explicitly do not want to have a copy of it returned
to the original system. Instead, we must have a reference to the
object so that it can be updated in subsequent calls and shared across
these calls.
How do we do this? There must be a mechanism for creating a reference
to an S object and returning that reference as an value in the non-S
system. If we do this, the programmer in, say, Python can then ask
for the different elements in the S object, or invoke methods or
functions on that object. This model will allow us to perform all
types of computations and in a vary natural way. Instead of sending S
language commands to R which refer to variables in its global
namespace created as a result of previous expression, we can invoke
functions/methods on objects as we do on local objects in the non-S
language. In this way, the programmer doesn't have to know the S
syntax but can use its functionality directly as if it were
local. And importantly, the computations are not limited to
simple data types such as primitive values and arrays of primitive values.
S-Plus provides a mechanism for
exposing S functions as COM objects. The model (see
<cite></cite>)
for the client is that the caller populates the collection of relevant
arguments by assigning values to argument names (and specifying the
types) and then calls the
Run. This returns
a status value indicating whether the call was successful or not. If
it was, the result is available in the property
ReturnValue.
This is cumbersome way of programming.
For example, if we want to call the
rnorm
with the sample size and the standard deviation,
we do something like the following (untested) in Perl:
$f = Win32::OLE->new("S.rnorm");
$f->n = 100;
$f->sd = 20;
if($f->run()) {
@x = @{$f->ReturnValue};
}
|
---|
We always have to use the argument names.
And getting the return value is separate from
the act of invoking the function call.
Perhaps one of the most problematic aspects
of this model is that the COM object
representing the function has state.
We store the arguments in the object
and call it separately.
In most languages, function calls are atomic
actions, but this is not the case here.
This means that potentially two clients
could be accessing the object at the same time
and setting arguments and interfering with each other.
We don't even need two clients, just two calls, for this to happen.
Suppose that a client was to use a random value from
the Uniform distribution and the maximum value is
to be chosen from another Uniform distribution.
Imagine that we define a Python function for this:
def getUniform(n, min = 0, max = 1)
"Assume that runif is a global object that has been initialized as CreateObject('runif')"
runif.n = n;
runif.max = getUniform(1, max = 20)
runif.run();
x = runif.ReturnValue;
return(x);
}
|
---|
If
<python:var>runif</python:var> was a global variable
that was a reference to the S function, we would get ourselves
into trouble. We would set the desired sample size in the S object.
Then we call the Python function
<python:func>getUniform</python:func>
recursively to generate the maximum value for this call.
That second call, of course, resets the sample size
to 1 in its call. As a result, the original call to
<python:func>getUniform</python:func> will generate
only a single random value, not n values.
Obviously, the approach to get around this is
to have each call to
<python:func>getUniform</python:func>
create its own instance of the
runif
COM object. But we had better remember to do this or
we will see strange results.
And of course, if we could call the
runif's
Run
with arguments in the call, this would avoid the
"state" problem altogether. Of course, there would be other issues
in compiled languages, but that is a different issue.
As we have hinted earlier in this document, we recommend and have
implemented a different approach to the ones described above. And
ours is sufficiently flexible that one could implement both approaches
above using our mechanism and some simple S code. Rather than
exposing the S interpreter and having clients send S commands or
exposing a function object that provides an invocation mechansism that
involves setting state, we offer a mechanism that allows users to
expose regular COM objects (i.e. with methods and properties) that are
implemented in the S language. It handles arbitrary values both in
the arguments and return values and is customizable in how it
dispatches method invocations locally at the S-level.
Additionally, providing S-level COM objects allows us
to easily register for COM events, e.g. for Active X controls.
The RDCOMServer package provides this mechanism
and can be coupled with the RDCOMClient,
SWinRegistry and
SWinTypeLibs to give a comprehensive S-language
interface to COM that is bidirectional and overcomes many of the
limitations with the other approaches. It provides a rich, structured
programming approach to distributed computing on Windows. A very
similar CORBA interface is also possible which will hide the
differences between the Unix and Windows details.