Compiling GPU Kernels in R Rllvm, Rnvvm and RCUDA

Compiling GPU Kernels in R
Rllvm, Rnvvm and RCUDA

Duncan Temple Lang

University of California at Davis

Department of Statistics

The idea here is to create a very, very simple kernel to run on a GPU. We do this by creating individual instructions using Rllvm. When we have defined the routine, we use the libnvvm library via the Rnvvm package to transform the LLVM IR code to PTX code. We can also generate the PTX code directly within LLVM. We can then load this PTX code into the R session using the RCUDA package and invoke the kernel. This simple example illustrates all of the steps we need to compile more complex R code as GPU kernels that we can then run directly from the R session.

	Note
	libnvvm is available in the current release candidate version of the CUDA SDK, namely 5.5. Rllvm, RCUDA and Rnvvm are available from github and also the Omegahat repository.

Generating the LLVM IR code

We start by creating the LLVM instructions to define our kernel. The kernel we want to implement is intentionally very, very simple and corresponds to the CUDA code

void kern(int N, int *out)
{
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if(idx < N)
     out[idx] = idx;
}

This takes an array of integer values and each thread sets its (idx) element to idx and we end up with 0, 1, 2, ...., N - 1 in out.

We could create the IR code by hand or generate this C (C programming language) code and compile it. However, we want to illustrate how to do this generally and be able to compile R-like code to a kernel. For this, we use Rllvm. We load that library and also some simple utility functions that will simplify our code generation and are generic for LLVM Functions that we will convert to PTX via libnvvm.

library(Rllvm)
source("nvvmUtils.R")

We can now start to create our kernel. We create a Module. Since we know we are targetting PTX code and libnvvm, we call ModuleForNVVM() to create an enhanced Module:

m = ModuleForNVVM("ptx kernel")

This function sets the data layout string on the module and also registers the special PTX register accessor routines so that we can use them in our code. These are the routines that correspond to accessing the x, y, z components of threadIdx, blockIdx, gridIdx, blockDim, gridDim

With the module created, we define our new Function() which will become or GPU kernel.

fun = simpleFunction("kern", VoidType, n = Int32Type, out = Int32PtrType, mod = m)
ir = fun$ir
localVars = fun$vars
fun = fun$fun

We have used simpleFunction() in order to simplify creating the IRBuilder, the initial BasicBlock and also to create local variables corresponding to the parameters. We could use Function() directly.

In order to be able to use this routine as a GPU kernel, we need to indicate that it is a kernel and not a device or host routine. We do this by adding metadata to the module that identifies this as a kernel. We do this with

  # declare that this is a PTX kernel
setMetadata(m, "nvvm.annotations", list(fun, "kernel", 1L))

We can define multiple kernels in the same module. See http://llvm.org/docs/NVPTXUsage.html for more information.

We can now focus on implementing the routine. The first step is to create

   int idx = blockIdx.x * blockDim.x + threadIdx.x;

The idea is that we will compute the index for this thread and put that in a local variable idx. We create the right-hand side

blockId = ir$createCall(PTXRegisterRoutines[["llvm.nvvm.read.ptx.sreg.ctaid.x"]])
blockDim = ir$createCall(PTXRegisterRoutines[["llvm.nvvm.read.ptx.sreg.ntid.x"]])
mul = ir$binOp(Mul, blockId, blockDim)
threadId = ir$createCall(PTXRegisterRoutines[["llvm.nvvm.read.ptx.sreg.tid.x"]])
idx = ir$binOp(Add, mul, threadId)

The important aspect of this is that we are accessing threadIdx.x, for example, via the PTXRegisterRoutines and the oddly named elements llvm.nvvm.read.ptx.sreg.tid.x. When we build a compiler for creating GPU kernels, we'll allow the code to use threadIdx$x and map these expressions to calls to the corresponding register routine.

We can now create the local variable and initialize it with the value of the right-hand side:

i = ir$createLocalVariable(Int32Type, "idx")
ir$createStore(idx, i)

Note that we are using the term idx in different ways here. In the first expression, we are using it as the name in the LLVM code. In the second expression, we are referring to the R variable assigned in the previous block of code that contains the addition of the two terms.

Our next step is to check if the value of idx is less than our parameter

. To do this, we need to create different blocks and conditionally branch to the appropriate block. One block will assign the value to the appropriate element in our array and jump to the end. The other block will simply exit the kernel routine. We create these blocks and the condition branch with

set = Block(fun, "set")
end = Block(fun, "return")

cond = ir$createICmp(ICMP_SLT, i, localVars$n)
ir$createCondBr(cond, set, end)

We now implement the block that assigns the value to the array. We ensure we are adding code to this block and then use a GEP instruction to access the relevant element of the array.

ir$setInsertBlock(set)
gep = ir$createGEP(ir$createLoad(localVars$out), ir$createSExt(ir$createLoad(i), 64L))
ir$createStore(ir$createLoad(i), gep)
ir$createBr(end)

The final command branches to the final block. We could have, alternatively, added an explicit return here.

We can finish the code by adding a simple return to the final block:

ir$setInsertBlock(end)
ir$createReturn()

It is always a good idea to verify that the code in the module is valid:

verifyModule(m)

Converting the IR code to PTX

The next step in getting the code onto the GPU is to convert the IR code to PTX code. We can do this directly with LLVM. (See llvmPTXUtils.R in this directory.) However, libnvvm does additional processing that gets the code above to work. We can use libnvvm in R via the Rnvvm package. We get the IR code as a string and fix it up to remove LLVM attributes that libnvvm doesn't currently understand. (This may be due to different versions of the LLVM IR format.) To convert the code, we use generatePTX() :

library(Rnvvm)
code = showModule(m, TRUE)
code = fixPTXCodeForNVVM(code)
ptx = generatePTX(code, isFile = FALSE)

The generatePTX() function in the Rnvvm is a high-level function that uses the lower-level routines in the libnvvm API. The result is a string containing the entire PTX code corresponding to our LLVM module. We now have what we need to load onto the GPU.

Using the kernel on the GPU

The final step is to load the PTX code and invoke it from R. For this, we use the RCUDA package and cuModuleLoadDataEx()

library(RCUDA)
cuda.mod = cuModuleLoadDataEx(ptx)

We can now invoke the kernel, passing it an array of integers and the number of elements it contains. We'll pass an R vector that has fewer elements than the number of GPU threads we run. This will exercise the condition in our code. We'll run 32 x 32 threads. We'll pass a vector with 100 fewer elements:

n = 32^2
N = as.integer(n - 100L)

We invoke this with

out = .gpu(cuda.mod$kern, N, ans = integer(N), outputs = "ans", gridDim = 1L, blockDim = c(32^2))

We test the result is as we expect with

stopifnot(identical(out, (1:N) - 1L))

An alternative approach

We experienced a problem with libNVVM when converting particular IR code to PTX. The problem manifested itself as

Error: R_auto_nvvmCompileProgram NVVM_ERROR_COMPILATION ( NVVM_ERROR_COMPILATION )
(0) Error: unsupported operation

We haven't fully explored the reason for the problem, but by experimenting with changes in the code, it appears the problem comes from computing potentially large indices from the thread, block, grid dimensions and indices to index an array. If we reduce the computations by removing a multiplicative term in the index, the code appears to work. So this lead us to also pursue our original strategy which is to use LLVM to generate the PTX code from the IR code.

LLVM provides several backends for which we can generate code from the common IR code. There are different backends for different processor types. There is also a backend for generating the C++ (C++ programming language) code that can be used to create the IR code again. (This is equivalent to the R code we use to generate the IR.) Another backend is NVPTX, or Nividia PTX. Basically, given an LLVM Module, there are a series of steps in LLVM that allow us to emit the PTX code for that Module as a string. To facilitate switching between using this approach or generatePTX() in Rnvvm, we provide a generatePTX() function in the Rllvm package. Currently, the inputs are slightly different, but this will change in the future as we integrate and move code in both Rnvvm and RLLVMCompile.

libNVVM does some code transformations as it generates the PTX code for us. Some of these are necessary to obtain results and so we have to deal with these when generating the IR code from within R before we generate the PTX code using only LLVM. The most important of these relates to parameters which are used to transfer the results from the kernel back to the caller. These are pointers/arrays which the kernel routine modifies, e.g.

out

in our kernel. We must explicitly identify these as being in the global address space on the device and not local or shared variables. The multi-level memory system on a GPU is quite different from the flat memory used by a CPU. Rather than specifying the parameter type as, say, Int32PtrType, we must create a pointer to the Int32Type that is in address space number 1. We do this with

ty = pointerType(Int32Type, addrspace = 1)

We can work with this as we would a regular pointer type. However, LLVM will generate PTX code that uses st.global-style operations when assigning to it.

The code in explorations/ptx_direct_range.R illustrates how to generate IR code from beginning to end. We'll discuss how to generate this code via a higher-level compiler below.

Next steps - High-level compilation

We clearly don't want to be writing R code to create each and every instruction. Instead, we want to be able to write the code for the kernel in a higher-level language and have an R function compile that to PTX, generating the IR code for the different implicit instructions. We would like to be able to write our kernel as something like

kern =
fnunction(N, out)
{
   idx = blockIdx$x * blockDim$x + threadIdx$x
   if(idx < N)
     out[idx] = idx
}

This is simple R code that won't run. There is no blockIdx or blockDim. However, we can compile it with the RLLVMCompile package. We have implemented a proof-of-concept for compiling a simple R-like function as a GPU kernel routine. We can use this with

globalInt32PtrType = pointerType(Int32Type, addrspace = 1)
fun = compileGPUKernel(kern, list(N = Int32Type, x = globalInt32Type), 
                       .zeroBased = c(idx = TRUE) 
                      )

We then can convert the module to PTX, as above, with

ptx = generatePTX(fun)

and load it onto the GPU with loadModule() .

The compiler recognizes expressions such as blockIdx$x and transforms those to calls to special intrinsic functions. We also extended the compiler to allow the user to specify which variables should be treated as-is for subsetting and not have 1 subtracted from their value. This is the .zeroBased parameter. We can also specify the types of local variables rather than relying on the compiler to use the type of their initial value. This allows us to 64-bit integers for some computations if we need.

The compilation for GPU code in compileGPUKernel() is very basic at present. It does illustrate how one can customize the basic compilation mechanism and adapt it to different computational models. An example is in explorations/gpu.R.