Duncan Temple Lang

University of California at Davis

Department of Statistics

The R Code

Consider the following R function to perform a cumulative sum on a vector:

Cumsum =
function(x)
{
   for(i in 2:length(x))  # pardon the case where x is of length 1 or 0
     x[i] = x[i-1] + x[i]

   x
}

We can check it agrees with the regular internally constructed cumsum() function via

all(Cumsum(1:4) == cumsum(1:4))

Now, we want to compile this. Perhaps a reasonable C version is

void
Cumsum(double *x, double *ans, int len)
{
    int i;
    ans[0] = x[0];
    for(i = 1; i < len; i++)
	ans[i] = ans[i-1] + x[i];
}

Note that we are passing in a reference to the original x and a new vector into which the cumulative sum elements will be inserted.

Compiling the Code

So we want to write LLVM code to build this routine.

library(Rllvm)

InitializeNativeTarget()

We create a module and declare the function by specifying the return type and the parameter types.

mod = Module("Cumsum")
ptrDouble = pointerType(DoubleType)
fun = Function("Cumsum", VoidType, c(x = ptrDouble, ans = ptrDouble, len = Int32Type), mod)

Note that we create an intermediate variable - ptrDouble to represent a pointer to a double. We also named the parameters in the third argument, i.e. x, ans, len. We could do this separately with

names(fun) = c("x", "ans", 'len')

Since we want to reference the variables, we'll get them as a list of parameter objects. Alternatively, we can treat the Function object as a kind of list and access the parameters via [[, e.g. fun[["x"]]. But we'll use the getParameters() approach for now.

params = getParameters(fun)

So now we can start to construct the body of the routine. We need an entry block in which we will create and initialize local variables:

entry = Block(fun, "entry")

We'll also create an IRBuilder to manage the construction of the instructions.

ir = IRBuilder(entry)

We'll need a local variable for i.

iv = ir$createLocalVariable(Int32Type, "i")

We also need local variables to reference x and y and len by their address.

xref = ir$createLocalVariable(ptrDouble, "x_addr")
ans.ref = ir$createLocalVariable(ptrDouble, "ans_addr")
len.ref = ir$createLocalVariable(Int32Type, "len_addr")

Next we initialize these variables. <q>Can we initialize them in the same order we create them and combine the allocation and Store? Seems like yes based on moving the C++ code to initialize the variables and compiling and verifying the module</q> We set i to be 1.

ir$createStore(1L, iv)

Then we get the address of the parameters and set the local variables that refer to these.

ir$createStore(params$x, xref)
ir$createStore(params$ans, ans.ref)
ir$createStore(params$len, len.ref)

So now we are ready to do some computations. The first thing is to set ans[0] = x[0]. We load x[0] and then store the value in ans[0], having loaded that. We load the value of x[0]

a = ir$createLoad(xref)
b = ir$createGEP(a, 0L)
x0 = ir$createLoad(b)

GEP stands for "Get Element Pointer". See http://llvm.org/docs/GetElementPtr.html Then we assign this to ans[0]

a = ir$createLoad(ans.ref)
b = ir$createGEP(a, 0L)
ir$createStore(x0, b)

At this point we are ready to jump into the loop. We branch to the test of the condition and within that we perform the test and determine which other block to jump to, the body or the return. So we need to create these 3 blocks: the condition test, the body and the return

cond = Block(fun, "loopCondition")
ret = Block(fun, "return")
body = Block(fun, "loopBody")

Now we branch to the condition, uncoditionally.

ir$createBr(cond)

We are now working on that and so want the new instructions to be added there.

ir$setInsertPoint(cond)

To test the condition of the loop iterator, we load the values for i and len and then compare them

a = ir$createLoad(iv)
b = ir$createLoad(len.ref)
ok = ir$createICmp(ICMP_SLT, a, b)
ir$createCondBr(ok, body, ret)

ICMP_SLT stands for "Integer comparison, Signed Less Than".

The return block is very simple. We return nothing so can create a void return

ir$setInsertPoint(ret)
ir$createRetVoid()

So now we focus on the body of the loop:

ir$setInsertPoint(body)

The loop is also very simple. We fetch the value of ans[i-1] and x[i], add them together and store the result in ans[i]. This involves many low-level steps. First we load i, then subtract 1 from that value, then load ans and use the value of i-1 to index the value and load that.

a = ir$createLoad(iv)
b = ir$binOp(Sub, a, ir$createConstant(1L))
r = ir$createLoad(ans.ref)
idx = ir$createSExt(b, 64L)
ans.i1 = ir$createLoad(ir$createGEP(r, idx))

So now we have ans[i-1]. Next, we have to load the value of x[i]

a = ir$createLoad(xref)
i = ir$createLoad(iv)
idx = ir$createSExt(i, 64L)
xi = ir$createLoad(ir$createGEP(a, idx))

Now we can add these two values xi and ans.i1

tmp = ir$binOp(FAdd, ans.i1, xi)

And now we need to load ans[i] and store the value of tmp in that. The following gets ans[i] in place.

x = ir$createLoad(ans.ref)
i = ir$createLoad(iv)
i = ir$createSExt(i, 64L)
ans.i = ir$createGEP(x, i)

Now we store the value from the tmp expression into ans[i]:

ir$createStore(tmp, ans.i)

Now we have performed the body of the loop. Next, we need to increment i and then branch back to cond

i = ir$createLoad(iv)
inc = ir$binOp(Add, i, 1L)
ir$createStore(inc, iv)

ir$createBr(cond)

Calling the Cumsum routine

Now that we have generated the routine we can call it via the run() command. We would like to be able to just access it via the .C or .Call interface, or via Rffi.

x = as.numeric(1:4)
ans = numeric(length(x))
tt = run(fun, x = x, ans = ans, n = length(x), .all = TRUE)
tt$ans

Now let's do a quick test of speed. We'll use the interpreted version Cumsum, the build in cumsum() implemented in C and our LLVM-generated routine.

N = 1000000
x = as.numeric(1:N)
ans = numeric(length(x))

a = system.time({Cumsum(x)})
b = system.time({cumsum(x)})
c = system.time({run(fun, x = x, ans = ans, n = length(x), .all = TRUE)$ans})

We should use the measurements from the second and subsequent invocations of fun to remove any overhead of the JIT. (These are real and important, but not for comparing execution speed of the fully optimized function. They count, along with the time to compile/generate and optimize the generated code.) So we get a speed up of 26 over the interpreted version. We are significantly slower than the cumsum() . Is this optimization in the C compiler? Remember that cumsum() is actually doing more, dealing with NA values, etc.

Optimization

ee = ExecutionEngine(mod)
Optimize(mod, ee)

N = 10
x = as.numeric(1:N)
ans = numeric(length(x))

val = run(fun, x = x, ans = ans, n = length(x), .all = TRUE, .ee = ee)$ans
all.equal(val, cumsum(x))

N = 1000000
x = as.numeric(1:N)
ans = numeric(length(x))

interp = replicate(10, system.time({Cumsum(x)}))
native =  replicate(10, system.time({cumsum(x)}))
ll = replicate(10, system.time({run(fun, x = x, ans = ans, n = length(x), .all = TRUE, .ee = ee)$ans}))
rowMeans(interp)/rowMeans(ll)

The results on my Mac (under some additional load) give (these are older results, probably from LLVM 2.7 or 2.8)

 user.self   sys.self    elapsed 
216.012780   6.666667 164.025822

Let's compare with the byte-code compiler. We'll first compare the byte-code compiled run-time with the interpreted version of the function.

library(compiler)
cCumsum = cmpfun(Cumsum)
compiled = replicate(10, system.time({cCumsum(x)}))

(rowMeans(interp)/rowMeans(compiled))[1:3]
 user.self   sys.self    elapsed 
  3.973048        Inf   3.975758

Now we compare the byte-compiled code and the LLVM-compiled code:

(rowMeans(compiled)/rowMeans(ll))[1:3]
 user.self   sys.self    elapsed 
39.8798587  0.3333333 33.6428571

Curiously, on eeyore - an Ubuntu Linux machine - we get much worse performance comparing interpreted to LLVM-compiled code. The interpreted code is running about 30 faster on the Linux box. The LLVM-code is running about as 10th as fast as it does on OS X.

print((rowMeans(interp)/rowMeans(ll))[1:3], digits = 3)
user.self  sys.self   elapsed
  23.8803    0.0435   20.5082

This may be due to