Using the RGCCTranslationUnit package.

Using the RGCCTranslationUnit package. Duncan Temple Lang Department of Statistics, UC Davis This document provides an introduction to using the RGCCTranslationUnit package. It is a work in progress and the API will probably change slightly. However, the basic facilities will remain approximately the same. Most likely, higher level facilities will be added that combine the calls to some of the different facilities into a single action.

Installing the package Firstly, you will have to install the RGCCTranslationUnit package. And if you want to read a translation unit file, you will need RSPerl. One can use the package without needing Perl if you are working with objects that were stored from a previous R session on a different machine. That is why the package does not explicitly depend on RSPerl. When installing RSPerl, one should instruct it to provide dynamic loading support for the Perl modules named Socket, IO, Fcntl, POSIX and Storable. Do this with R CMD INSTALL --configure-args-'IO Socket Fcntl POSIX Storable' RSPerl When you load RGCCTranslationUnit and try to parse a tu file using parseTU, you may get errors or even an entire crash of R (which relates to not having the dynamic loading for the modules mentioned above). See the FAQ.html file on the Web site or in the package for some suggested. We have done almost all our work and testing with .tu files generated by GCC version 3.2.2. Using tu files created with version 4.1.0 of GCC will generate lots of output when using parseTU(). These are warnings about particular elements not processed by the Perl parser module. These will be fixed soon. The output for the tu files is different from the versions of GCC, but when we process them using our tools to higher level objects, things seem to be the same for the code we have tested. More details are necessary. One thing to note is that version 3.2.2 generates files named foo.c.tu and foo.cpp.tu, i.e. by appending the .tu to the name of input source file. Version 4.1.0 appends .t00.tu, e.g. foo.c.t00.tu.

Creating the translation unit file If you want to read the contents of source code from C/C++ or header files, you will need to generate the translation unit from those files. You can generate the .tu file from each .c/.cpp/.cc/.h file or, if you want, generate one large tu file for all of them. To do the latter, create a new C/C++ file whose content simply consists of #include "firstFile.c" #include "secondFile.c" where you use the file names of interest to you. Now that we have the relevant source file, say foo.c, we generate the .tu file using gcc -fdump-translation-unit -c foo.c -o /dev/null Note that I have told the compiler to write the object file (.o) to /dev/null, i.e. not to create it. It doesn't matter if you do. You will need to add any additional compilation flags such as include directories with -I, and defines with -D. I typically put a rule in a GNUmakefile to generate the .tu file as these flags are already present for doing the regular compiling. Even if you get an error from the compiler about the content of the source code, the compiler may still have generated the .tu file. If the error is a syntax problem, the .tu file will not be generated. But if the compiler has read the entire file and is only giving errors about the validity of the meaning of the C/C++ code, then it will have dumped the .tu file before it reached this processing stage. Having created the .tu file, we only need to revisit this step if the source code changes. And if we have the appropriate dependencies in our makefile, the .tu file will only be generated when these have changed and you call make.

Now that we have the .tu file for input to our processing, we start an R session and load the RGCCTranslationUnit package. library(RGCCTranslation) Next, we use parseTU to read the .tu file: p = parseTU("foo.c.tu") or foo.c.t00.tu if using a more recent version of GCC. The value in p is a strange thing. It is an R object that actually identifies a Perl object. If you call class on this object, you will see that it is a reference to a Perl object, in fact a Perl array (ordered list) and, most specifically, is a GCC::TranslationUnit::Parser class(p) [1] "GCC::TranslationUnit::Parser" "PerlArrayReference" [3] "PerlReference" The elements in this Perl array are the nodes in the tu graph. And there will be lots of them: .PerlLength(p) You can access the elements by position, e.g. p[2] Note that the first one is a "dummy" node and of no interest to us. For those of you who are unfortunate enough to be in any way familiar with the format of the .tu file, you can use the node ids (i.e. the number after the @123) to identify the node p[["123"]] Note that all the nodes are also R objects that refer to Perl objects. And all of them have classes indicating the type of node in the tree, e.g. GCC::Node::namespace_decl, GCC::Node::record_type, GCC::Node::function_decl. Most of the nodes are actual hash tables in Perl, i.e. like named lists in R. You can find out what elements a node contains using, e.g. names(p[[3]]) If it helps to know, the names come from fields in the .tu file for that node. Those "loose" values such as const or volatile are accessible through method calls on the node, e.g p[[3]]$quals() or via the R function getQualifiers. If you are not familiar with the .tu file format, don't worry about all the details; you can just think about what they might mean. For the most part, the higher-level functions in the R package will remove the need for you to know anything about the nodes. Before we proceed, we should note that if you are working with C code and not C++ code, but chose wisely to use the g++ to create the .tu file, then you should tell the parser that it is really C code. We do this in R with p = setLanguage(p, "C") or alternatively, when reading the TU file we can specify the language p = parseTU("foo.c.tu", "C") This does not affect the Perl object, just the R-side of things in further processing. These node references are intelligible, but still somewhat low-level concepts. And certainly we don't want to be dealing with the nodes of the graph and following the paths on this graph to work with routines, global variables, data structure definitions and so on. So we use the higher-level R functions that do this for us. Typically, we will be interested in the routinesI use the term routine to refer to what some call a function in native code. We use function to refer to R language functions. Unfortunately, the Perl parser and tu file refer to GCC::Node::function_decl to confuse matters.defined within the native source code. We can use the function getRoutines to get a list of the nodes that are native routine declarations. routines = getRoutines(p) getRoutines provides a brief description about the routine declarations. It identifies the node by it index and name, and provides the identity of the nodes for the return type and the parameters of the routine. We do have the names of the routines accessible via names(routines) If you do this on your file, you will notice that there a lot more than you might expect. And you may not recognize all the names, but some will be familiar to you if you are a C programmer. Names like fprintf, memcpy, strchr and so on are "system" routines, i.e. those defined in header files provided by your operating system or compiler. They are present because of included header files, e.g. #include "filename" in your original source code. The compiler has dumped everything it could see at the code level so that we can make sense of it. We usually don't want to deal with all of these routines but rather limit ourselves to those in our source code files. getRoutines has a files parameter that can be given "names" of files which are used to filter the function declaration nodes returned. Only the function declarations whose source attribute corresponds to an entry in this vector will be returned. So routines = getRoutines(p, "msa.c") will give those routines defined in msa.c. And it will (currently) return more quickly than with no filter. There is a problem with giving the file names on which to filter. Specifically, the compiler only gives us the file name and not the entire path to the file. Thus, it is impossible for us to distinguish between files with the same name but in different directories. For example, suppose we have a local source code file named time.h. Unfortunately, there is a system header file in sys/ also named time.h. We include time.h directly using #include "time.h" and it is found locally by the compiler. Even though we do not include the system time.h and there is no ambiguity, another system header file might have a line ]]> and then we have introduced two different files named time.h. And only the file name will appear in the code. So we sometimes have to then filter the set of nodes returned by getRoutines further. But we can do this in R using the usual subsetting operators on the returned list since we have names on the elements. There are tools in the package to help differentiate between code from files with the same name. See By the way, note that there is a difference between a declaration and a definition. The declaration node may have a reference to the actual definition of the routine and on to its body. But if you used the regular C compiler (gcc) and not g++ or if you are working from header files rather than the complete source code, then you won't have the definition but just the declaration. All the nodes returned by getRoutines are function declarations, and some may have a path to the body. There are functions for finding other types of nodes. To find nodes corresponding to "global" variables within the source code, we have getGlobalVariables. To find enumerated constants (enum) definitions, we use getEnumerations. And to find declarations of data structures, e.g. structs, unions, typedefs, the function getDataStructures searches the entire collection of nodes. For working with C++ code, we can find the class definitions using getClassNodes. (This is named as such to avoid any idea that it is similar to getClass in R.) Like the getRoutines function, each of these accepts a files argument to filter the nodes based on the source attribute. These functions do not return as much information directly as getRoutines. getEnumerations and getGlobalVariables return just the index of each enumeration declaration node. getClassNodes returns the name/id and the index of the nodes. getDataStructures returns the node object directly, i.e. the PerlReference object. Regardless of precisely what these functions return, they each give us information about how to identify the nodes of interest. And what we want to do is move from dealing with nodes to R-level data structures that describe high-level entities within our native code. To do this, we only need the starting or top-level node for that entity, and so essentially all of these functions are returning us that information, i.e. the index of the node of interest. We go from this node to the R-level data structure by resolving the node and following all its references through the graph.

Resolving the types So far, we have seen how to read the node graph into R and find the nodes corresponding to different high-level entities in the source code, such as routines, global variables, data structures and classes. Each of the functions returns the identity of the nodes corresponding to the entity in the source code. The identity is given by the node id or name, e.g. the "123" in "@123", and also its position in the Perl array. (The id and position are related as the position is the integer value of the id less 1.) Having these node identifiers is fine, but it is not what we want to work with. We want an R object that describes a routine completely or gives us the fields in a struct. To do this for a given node, we need to traverse the graph starting at that node and collect all the different pieces. You can do this yourself, but it is tedious and somewhat complicated. The package provides a function resolveType which attempts to do this for you. We start with a node, say the first one from routines = getRoutines(p, "msa") We can call resolveType giving it the first routine to resolve along with the graph or array of nodes which will be used to follow references to other nodes. So type = resolveType(routines[[1]], p) gives us an R object of class ResolvedNativeRoutine. This is different from the NativeRoutineDescription we had in routines[[1]] because this now contains all the information about the return and parameter types. And importantly, these type objects are entirely within R and do not need to refer back to the graph/tu parser. It is possible that there are PendingType objects within these so. These arise in recursive definitions when one type is being defined and refers to itself or another type that is also being defined. Such objects know how to resolve themselves when used, at least within the same R session. But you will not have to return to the tu parser and resolve them explicitly. So we have resolved the type into a single, self-contained R object without references to other nodes and types defined elsewhere. This is the computational model in R - no references. And this is convenient but can also make some computations on graphs complicated. Let's look at the ResolvedNativeRoutine object. It (currently) is an S3-style object. It has the same elements as the NativeRoutineDescription, i.e. the name, returnType, parameters and the INDEX of the defining node should we want to return to it. The name is type$name [1] "msa_reverse_data_segment" The names of the parameters are names(type$parameters) [1] "data" "start" "end" So far this is the same as if we hadn't resolved the routines[[1]] object. However, looking at the returnType field we see class(type$returnType) [1] "voidType" attr(,"package") [1] "RGCCTranslationUnit" This tells us that there is no return value from this routine, i.e. it is declared as a void. Note, in the NativeRoutineDescription, this was the PerlReference object to the Perl GCC::Node::void_type object. If there were a non-trivial return type, this would have been resolved and we would have a description of that type, e.g. an unsigned integer, a struct, etc.. What about the parameters? Each parameter is also an S3-style list object that has fields giving the name of the parameter (id), the type and the default value, and the node in the parse graph where it was defined. We can find the class of the type of each parameter with sapply(type$parameters, function(x) class(x$type)) data start end "PointerType" "intType" "intType" This tells us that the first parameter is a pointer and the two others (start and end) are integers. We have to look further to see what type the pointer points to and whether the start and end parameters are simple int values or more complex such as unsigned int, etc. All this information is in the R objects within the type field of the parameters. It is available to you to do whatever you want with it. However, there are functions in the package that will do some of the things you might want to do, such as creating an R interface to the routine (see createMethodBinding). We can resolve any node with resolveType and indeed resolveType does this itself as it processes the nodes recursively. For example, we can find all the anonymous or unnamed/typedef'ed enumeration types by looking for those enumerations that have strange, invalid C names such as "._digitdigit..." which is how gcc/g++ identifies them to us. So we can do this using R functions such as grep to find the enumerations with these odd names, such as ee = getEnumerations(p) grep("\\._[0-9]+", names(ee), value = TRUE) Then we can resolve these and find the mapping between the symbolic names and integer values. We might be more interested in the named enumerations in our files e = getEnumerations(p, c("msa", "hmm")) hmm_mode = resolveType(ee[["hmm_mode"]], p) This is an object of (S4) class EnumerationDefinition. The key slot is values which contains a named integer vector giving the definition of the enumeration elements. In our hmm_mode, we have hmm_mode@values VITERBI FORWARD BACKWARD 0 1 2 And we can do the same for the other nodes from the data structures and classes. The key thing is that we give a node or node identifier (node name or position) along with the parser/graph/node array itself. For C++ classes, resolving the class finds the fields and the name of the class as well as the usual qualifiers such as scope. It also includes the names of the base class(es) and their node identifiers and the resolved methods for this class alone. It does not include the inherited methods, but these can be retrieved by resolving the ancestor class nodes. resolveType uses S4 methods to know what to do for each type of node and will create an R object that describes what it collects on its traverse of the graph. This makes it extensible. Hopefully, the objects that are returned contain all the information we need when processing the code in R. If not, please let me know. And also, it is typically possible to recover more specific information using the object returned as the identities of the sub-nodes are typically included in the result.

Duplication and the DefinitionContainer We can use lapply to resolve a list of routines, e.g. returned from getRoutines. If you think about this, there is likely to be a lot of redundant computations involved. If two routines refer to the same type, the resolveType will end up visiting that type node twice (at least). By design, we will end up with separate copies of the resolved type in each resolved routine, but we do not want to do the processing multiple times. To avoid this, we pass a third argument to resolveType and provide a DefinitionContainer object which is used to manage the resolved types. For the most part, you will never be concerned with how this works. It is a good idea to create one and assign it to an R variable and then pass it to resolveType whenever you call it. But it is important to remember that one cannot use a DefinitionContainer with a different tu parser/graph/node array. The DefinitionContainer uses the ids of the nodes in the tu graph to manage the types. So if it is used with a different set of nodes, there will be a great deal of confusion! The DefinitionContainer acts as a broker for resolveType. When resolveType is asked to resolve a node, it asks the container whether it has already been resolved. If so, it retrieves that and simply returns the previously computed value. If not, it asks whether that node is currently being resolved, i.e. by a higher-level (in the call stack sens) call to resolveType. If so, it returns the PendingType for that node which is essentially a promise to give you the type later. Otherwise, it goes ahead and resolves the node and then tells the container the result so that it will be available for future calls. The DefinitionContainer is simply an R hash table or environment and uses hashing to make lookups fast since there are typically a lot of nodes.

Creating Interfaces to Native Code One of the purporses of the RGCCTranslationUnit package is to be able to create interfaces to C/C++ code for R. This involves creating R functions that call C/C++ routines and methods which in turn call the routines and methods in the original C/C++ code. So we create both R and C/C++ wrapper code. We also create R classes to hold references to C and C++ structures and classes. And we also create R classes that mirror the definitions of C structures and unions so that we can copy objects to and from C and R. This allows us to have objects that persist across R sessions and also allows us to copy objects which may be deleted in the C code. Since we are creating new C/C++ routines that we call from R, we also need to create registration information for these and add them to a NAMESPACE file so we can access them from within R properly. And finally, for each C++ class, we want to define a new C++ class that allows the R user to extend that class and implement methods using R functions. There are various functions in the RGCCTranslationUnit package to achieve these different aspects of generating code. Let's start with the case that we have some C code, e.g. the msa.c file. We assume that we have created and read the tu file describing the code into R. p = parseTU("msa.c.tu") Now, we will focus on the routines. We get the routines defined in the msa.c file with r = getRoutines(p, "msa.c") The next step is to resolve these with types = DefinitionContainer(p) msaRoutines = resolveType(r, p, types) So the next step is to generate code for these in both R and C. createMethodBinding takes one of these resolved routines and generates R and C wrapper code for it and provides additional information that can be used to register the code. We loop over the routines and call this function with bindings = lapply(msaRoutines, createMethodBinding) And now we can write the R and native code to different files. We use writeCode to output the code for either language as this understands the structure of the bindings object. To write the R code, we give the command writeCode(bindings, "R", file = "msa.R") (The "R" here indicates that we want the R code, not the native C/C++ code.) We can pass a connection object as the file argument rather than a simple file name. This allows us to pass an already open connection so that the content can be appended to that file. We use the same approach to create the C source code file. However, in this case, we have to arrange to add top-level material that includes the appropriate header files. We can do this in two ways. The first is to open the file and write the material ourselves and then pass writeCode the already open connection. ", "", "", '"RConverters.h"'), con) writeCode(bindings, "native", file = con) close(con) ]]> For this case, we only have include files to specify and we can pass this vector to writeCode via the includes argument of : ", "", "", '"RConverters.h"')) ]]> At this point, you should hopefully be able to read the R code back into R using source. Additionally, you might be able to compile the C code with R CMD COMPILE Rmsa.c assuming you have the relevant compilation flags set in the Makevars file or in the PKG_CPPFLAGS environment variable. And you can create a DLL/Shared Object with R CMD SHLIB Rmsa.c and then load that into R with dyn.load. Of course, we can put this code directly into a package structure and INSTALL and then load it using library. Unfortuantely, the code is not likely to usable without some additional steps. The C code may not compile cleanly or link at all because of references in the generated code to routines that don't exist. Firstly, we need to make certain we have a copy of the RConverters.h file that we added as an include when generating the C code. And we will want its sibling file RConverters.c also. In the future, we will simply add a dependency in the DESCRIPTION file of our new package to the RAutoGenRunTime package. But for now, simply copy those files into the same directory as the newly generated Rmsa.c file. You can find these fines in the RAutoGenRunTime source distribution. But we have further issues that are specific to the code we generated. For example, the wrapper for the routine msa_new_from_file will have a call to the routine R_copy_MSA_to_R. This performs a "deep" copy of the MSA reference to create an R object. Firstly, we have to create that routine and add it to Rmsa.c or another file that we will compile and link with Rmsa.o. Additionally, we have to create the R class definitions for MSA and MSARef. Of course, we don't have to do this manually, but rather the RGCCTranslationUnit package has code to do this. We get the definition for the MSA data structure. We can do this using getDataStructures and finding the node associated with its definition and then resolving that node: dataStructs = getDataStructures(p, "msa") MSA = resolveType(dataStructs[["MSA"]], p) Of course, we don't need to do this as it must have already been resolved when we resolved the routines. After all, the msa_new_from_file routine must have resolved it as it is (part of) the return type. If we look in the DefintionContainer that we used to manage the resolved types, we will find it there: types$MSA and it is fully resolved. Now that we have the defintion of this type, we can generate code to define R classes and copy it to and from C, and provide element-wise accessor to the fields. We use the function generateStructInterface to do all of these things. This calls defineStructClass, createCopyStruct, and createRFieldAccessors for each of the different tasks. We pass only the resolved type to the generateStructInterface as in msa_iface = generateStructInterface(types$MSA) This is an object of class CStructInterface and has information about the class definitions (in classDefs), the generic functions for the $ and $<- methods for the reference class, i.e. MSARef, coercion methods for converting between C and R versions of the type, i.e from MSA to MSARef and from MSARef to MSA. And of course it has the C routines that perform the copying and provide access to the fields. We can write the different elements of the CStructInterface to a file (or any connection) using writeCode. There are methods for this generic function for handling this class and its sub-elements. So writeCode(msa_iface, "r", "/tmp/msa.R") writeCode(msa_iface, "native", "/tmp/Rmsa.c") will create the two files with the generated code. This defines two classes, a reference class that holds a pointer to an object in C/C++ and an equivalent R-only class which has slots parallel to the fields. There is a constructor function for creating a reference object, e.g. ref = new_MSA() We have access to fields in a reference object using the $ method, e.g. ref$nseqs and of course to the slots in the R-based object obj@nseqs. names called with a reference object gives back the field names. And we can set fields in the reference object with ref$nseqs = 100 Note that the coercion is done for us. So in this way, the reference appears like a list in R, but it is important to remember that changes to the underlying C-level object will be seen in subsequent computations in R. We can use as to copy from one representation to the other with obj = as(ref, "MSA") as(obj, "MSARef") How do we know which data types are being used in the routines and therfore for which we need to generate these class definitions and routines? The simplest method is to resolve the routines you are interested in and then look at all the data structure elements in the DefinitionContainer you passed to the resolveType call. This will contain all the types that were encountered and needed to be resolved. Therefore, these are the types you will need to have code to support. And last, but not least, we need to generate code to handle global variables and enumerated constants.

Manipulating the Source Code In addition to being able to generate bindings, the RGCCTranslationUnit package provides programmatic access to manipulating very detailed information about C or C++ code. If you have generated the tu files using the C++ compiler, g++, and are working with the complete source code (i.e. .c) rather than the header file and the declarations of routines, the function declarations in the tu file will have a reference to their bodies. This allows us to process the code in the routine and not just its declaration about parameters and return type. There are numerous things we can do with this information.

Call Graphs The simplest is to find out what routines are called from within a particular routine. In other words, starting with a routine, say msa_read_fasta, we can find out what routines are called in that code. And we can do this for numerous routines and build a graph of what routine calls what other routines. This is a very helpful tool for understanding code and also for doing statistics on the structure of the software. This is the static call graph as many of the calls to other routines may never occur as they might be enclosed within conditional statements. The run-time call graph is observed when the code is run and tells us about what routines were actually called by another. It is interesting to compare these two graphs. The function getCallGraph computes the call graph for a given routine. We give it ae routine description returned from getRoutines and it will recursively follow all the nodes in the body of the routine and find all calls to other routines. Note that it will not be able to handle calls via routine/"function" pointers. Returning to our msa.c code again, we can use getCallGraph with the following code. p = parseTU("msa.c.tu", language = "C") routines = getRoutines(p, "msa.c") calls = getCallGraph(p, routines$msa_new_from_file) The result stored in calls is very simple and is merely an integer vector identifying the function_decl nodes in our parser/graph that were called within the body of the routine msa_new_from_file. The names of the elements in the vector calls are the names of the routines, and these are more interesting to the human. However, we do often want to be able to follow those routines and so having the node identifiers is convenient. So names(calls) [1] "str_new" "msa_read_fasta" "la_to_msa" [4] "la_read_lav" "ss_read" "fscanf" [7] "die" "msa_new" "smalloc" [10] "smalloc" "msa_alph_has_lowercase" "smalloc" [13] "smalloc" "str_readline" "str_trim" [16] "strcpy" "fscanf" "fgets" [19] "isspace" "strcpy" "fgets" [22] "isspace" "toupper" "isalpha" [25] "die" "die" "str_free" gives the names of all the routines that were called. Note that there are duplicates, e.g. die and smalloc. And the order is often suggestive of the order the routines are mentioned in the code, but of course we branch down different if and while statements and so the order is not necessarily meaningful. Rather than looking at the raw names, we can see how often each routine is called using sort(table(names(calls)), decreasing = TRUE) smalloc die fgets 4 3 2 fscanf isspace strcpy 2 2 2 isalpha la_read_lav la_to_msa 1 1 1 msa_alph_has_lowercase msa_new msa_read_fasta 1 1 1 ss_read str_free str_new 1 1 1 str_readline str_trim toupper 1 1 1

Determining output parameters As we mentioned above (not yet!), a routine can have parameters that are meant to return information and not act simply as inputs. These parameters will be pointers or possibly references in C++. But of course, not all pointers or references are actually used to return information. The paramStyle parameter of createMethodBinding allows us to indicate which parameters are to be treated as inout- or out-style parameters, and the values are included in the return value to R. Unfortunately, a human has to specify which parameters are inout and out. There is no convenient, portable way for the author of the code to indicate this directly in the code although she might document it and then it becomes easy for a human to find this information. In the absence of that however, one can read the code and follow the logic to find out what parameters are modified and so presumably an out value. If we can do this by eye, we can come up with at least ad hoc approaches to doing programmatically from the detail code description of the body. The function getInOutArgs is an initial attempt to do this. We start by resolving the routine of interest, e.g. p = parseTU("inout.c.tu") r = getRoutines(p, "inout") foo = resolveType(r$foo, p) Then, we pass this routine object to getInOutArgs along with the parser nodes p and getInOutArgs recursively processes all the code and (attempts to) determine if the parameter or any of its fields was assigned a value in the code. That would make it an out argument. It is an inout argument if a value was accessed in the parameter not as the left hand side of an assignment operation. This can become complicated and so the code needs to be trained and made more particular. But it does work on some trivial and some real example code. The result from the call getInOutArgs(foo, p) is a list containing the parameter descriptions from the routine of those mutable parameters that appear to be modified in the code of the routine. This currently does not handle calls to other routines to determine whether they are modified.

Global Variables In addition to data structures and routines, we also want access to global variables defined within the native library. Global variables are bad, but constant ones are okay. They act as symbolic constants. These const values are relatively easy to deal with in R. For such variables that correspond to basic types in R, e.g. integer, logical, character, numeric, it is obvious that we can merely create parallel versions in R bound to an R variable with an equivalent name as in the native code. Since the variables in the native code are constant, they will not change and there is no need to synchronize the value of the corresponding R variable. All that we need to do is compute the values and assign them to R. We can do this when we load the package, or preferably when we are installing the package as this computation only needs to be done once, not each time the package is used. Of course, if the original native code changes between R sessions, we will have to recompute these values, but we will have to regenerate the entire set of bindings, so our claim is still true. We identify all the constant primitive global variables in the files of interest and then create C/C++ code that is compiled into an executable. This executable can be run when the R package is installed and it generates var <- value R expressions which can be put in a file as part of the R source code for the package. We'll use the wxWidgets library (http://www.wxWidgets.org) as an example. We have the translation unit nodes in tu and the names of the files containing the code in targetFiles so that we can filter only those global variables and not the ones from the system files. We call ') ]]> Next, we compile this code gcc -o wxConstants `wx-config --cflags --libs` wxConstants.cpp adding the additional compilation and linker flags we need. Next, we run the executable wxConstants.R ]]> and this creates the file wxConstants.R that looks something like And so we can load this into R with, source("wxConstants.R") or simply add it to the R directory of the R package you are creating. We run the executable in the configuration script of the package to put the

Non-constant Global Variables For non constant variables, we need to be able to ensure that we get the current value of that variable. In the phast code, we have only two global variables, both of which relate to regular expressions. These are re_syntax_options and re_max_failures. The former has type re_syntax_t and the latter is an int. Neither are const and they can change within the codeIt appears that re_max_failures is never modified within the code, but it may be from code outside of this library that links to it!. The "simplest" thing we do is to create an R object that is a reference to the native variable, i.e. contains the address of that variable. We can ask for the value of that object using the valueOf() function in R and that derefences the address and returns the current value (either as a regular R object or as an external pointer). This allows us to get and use the current value in arbitrary calls. For example, we can get the value, call a routine which will have a side effect of updating the value and then get the new value with code something like: valueOf(re_max_failures) foo() # which changes the value of re_max_failures valueOf(re_max_failures) We could also use as(re_max_failures, "integer") or as(re_max_failures, "numeric") and this will call valueOf() and then coerce to the final target type. Unfortunately, we are not provided the target type in the coercion method, i.e. the to value. So one has to use as(valueOf(re_max_failures), "numeric") We could introduce a hideous dereferencing syntax such as x$'$' that is trivial to implement, but... We should also be able to assign a value to it. This is syntactically slight akward. One might think that re_max_failures = 100 should do the job. But of course, that will overwrite the current value of the R object with the R value 100. It will not assign the value to the native variable. We could arrange for this to work using the RObjectTables package and having assignments to these mirrored variables be assigned to the native variable. This is a nice application of the RObjectTable mechanism and we may pursue it in the future as time admits. In the mean time, valueOf(re_max_failures) = 100 could be used. We can also use RObjectTables to dereference the value of one of these VariableReference objects when it is accessed. The RObjectTable instance would then act as a broker for accessing these values and pass the requests to access or set the native variable as if it were an R variable. If we want to make use of the value of the variable in a call to a function, we can use valueOf, e.g. bar(valueOf(re_max_failures)) However, it would be more convenient and natural to write bar(re_max_failures) And when bar is routine for which we have generated bindings, it is more efficient and simple to pass the R value of re_max_failures directly as a reference and have it be dereferenced in the C code rather than copy it from C to R and back. This is somewhat tricky to handle portably with variables that are not themselves pointers but rather actual literal data types, e.g. int structs, ... So in the meantime, we will arrange that there is a coercion method for the different types via a call to valueOf. This converts a VariableReference to its value. Unfortunately, it is currently difficult to then ensure that that is the correct type. So, we arrange for the dereferencing of a VariableReference object to attempt to perform the derferencing appropriately in the C code since we know what the target type is in the general dereferencing, i.e. via R_GET_REF_TYPE (macro) calls in C. That's the general strategy and interface. There are some details that need to be handled (e.g. how to set global variables, and the case of a parameter being a struct object rather than a reference, but these don't pertain specifically to global variables.) So let's see how we can use the functions in the RGCCTranslationUnit package to generate the interface to the global variables. We'll work with the globals examples in the package which has 3 global variables (a, aref and i), one static variable (dummy), and a constant double named x. The tu file is in globals.c.tu. p = parseTU("globals.c.tu") gg = generateGlobalVariableCode(p, "globals.c") writeCode(gg$consts, "c", "globalConstants.c", includes = '"globals.h"') Note the quotes around the value of the includes value. writeCode(gg, "r", "globals.R") writeCode(gg, "c", "Rglobals.c", includes = c('"globals.h"'. '"Rglobals.h"'))

Other "constants" - enums, defines In addition to regular top-level/global variables, we also have other types of variables which are enumerations and preprocessor defines. We can find all enumeration definitions within the code using getEnumerations.

Customizing the Code Generation: Type Maps For any particular piece of code, you may know some specific information about how to convert an C-level structure to R or vice versa. The generic treatment as a reference may not be sufficient. So you will want to be able to influence how the code is generated and the data types marshalled to and from R. The concept of a "type map" is used for this. This is merely a table, i.e. a list, in R with elements that identify the particular type to which it refers and provide information about how the different aspects of the marshalling process are done. At present, there are three different steps in marshalling. The first is in the R function that is called by the R user and involves coercing the input value to the appropriate type. For example, if the C routine requires an integer for the parameter x, there is a line in the R wrapper function of the form x = as(x, "integer") This ensures that the value has the appropriate representation when passed to the native routine. Importantly, it also allows the user to pass a value which is not an integer to the function without worrying about the particular type. And it also allows us or the user to provide methods for coercing objects of different types to an integer. This makes the mechanism extensible to us and others. The second step in marshalling is when the R value is converted to its equivalent C type. For instance, when we pass the integer object from R, this is converted to an int in C with the code int x; x = INTEGER(r_x)[0]; This is built in to the bindings to access an R object which is expected to be an integer object and map it to an int in this way. However, we may want to change how this is done. And the last step in the marshalling is to convert a value from C back to R, e.g. the return value of a routine or the value of a field in a struct. Again there is some C code that does this and returns the appropriate value or leaves it in a suitable variable. The type map provides a mechanism to optionally control any of these marshalling steps for a given native type. One specifies the target type either by a simple C-like declaration, or via a more complete and formal TypeDefinition object. The other elements named coerceR, convertRValue, and convertValueToR relate to the three steps respectively. A type map element in R is specified as a list with a target element and any or all of these three actions. The target is a string or TypeDefinition object and is compared to each of the types to which we are trying find a match in the type map table. Each of the three action elements can be either a simple string giving the name of an R function or C routine (as appropriate), or an R function. If a string is given, this is simply used in a call to convert the given R or C variable to the appropriate value. All that the code generation mechanism does is take that function name and create an invocation string with the given argument. So if the string were "as.integer", we would see as.integer(x). However, if we want a more interesting call such as open(x, "r"), simply specifying a single string will not suffice. So it is most general to provide a function which will generate the relevant code string. Such a function takes three arguments: the name of the variable being processed, a named list of the parameter types, and the type map object to be used in recursive processing. The function is expected to return a string which can be inserted into the generated code. If it is a simple string, some of the code generation functions will append additional code such as assignments to local variables. One can avoid this by returning, e.g. an object of class RCode, to by-pass any default processing. A function is most general, but often just as with the open(x, "r") we just want have the variable name inserted into the first argument. We can do this by providing not a string, but a character vector. And there are two ways to do this. One is to provide a character vector with NA values where one wants the variable name inserted. For example, if we wanted the result to be the string foo(x, length(x)), we could provide as our type map converter a function of the form function(name, parameters, typeMap) { paste("foo(", name, ", length(", name, ")") } Or alternatively, we could provide the slightly simpler c("foo", NA, ", length(", NA, ")") and avoid the function definition. Since this is simply text substitution and there is no dynamic computation involved, this is probably simpler. And if we have a very simple case such as wanting to produce open(x, "r") which only involves a single insertion, we can provide a character vector of the form c('open(', ', "r")') This is a character vector with just two elements. The conversion code that uses the type map element will insert the variable name between the two elements and glue them together to form the desired string. We have given the dry, "formal" description of these type maps but they are easier to understand with an example. We will consider the case where a C routine takes a parameter FILE *. This is a simple pointer to a FILE instance which is obtained from opening a file in some format. There is no analogous type in R that we can use. Connections are analogous, but there is no API that allows us to access the underlying FILE instance. There are two straightforward strategies to map these from R types to C when going through the R and C wrapper routines. The two step process provides us degrees of freedom to consider how this can be achieved. The first approach is to convert the argument to the R function to the name of a valid file. Then in the C code, we call a routine with the R object giving the file name and return the desired FILE * instance. We can do this with a type map as typeMap( "FILE *" = list(target = "FILE *", # Can be a string, e.g. asFILE, but then couldn't get the mode 'r' in. coerceRValue = function(name, ...) paste("asFILE(", name, ", 'r')"), convertRValue = "R_openFile" )) Note that we specify the target type for which this element matches in its C declaration format, "FILE *". We can give this as either the name of the type map element or explicitly as the target entry within the type map element. Then we specify the mechanism to coerce the R value to a file name which is a call to a function asFILE which we will have to write. This checks the file exists and then expands the file name to expand ~, etc. which accepts for filenames, but C does not. The conversion from this R type to the C-level type FILE * is given by the convertRValue element. This is merely a call to a hand-written routine that takes the R string and dereferences it and calls the open routine to create and return the FILE *. The typemap example contained in the package illustrates this in the context of a simple routine getLine. The directory contains all the relevant code to generate the bindings and the results and is "runnable". For the routine declared as char *getLine(FILE *) we end up with the R wrapper function getLine = function( f ) { f = asFILE( f , 'r') .Call('R_getLine', f) } and C routine SEXP R_getLine(SEXP r_f) { SEXP r_ans = R_NilValue; FILE * f ; char * ans ; f = R_openFile ( r_f ) ; ans = getLine ( f ) ; r_ans = mkString( ans ? ans : "") ; return(r_ans); } Note the calls to asFILE and R_openFile that come from our type map. So we need to define these two entities (see utils.R and Rutils.c respectively.) asFILE = function(filename) { if(!file.exists(filename)) stop("No such file ", filename) path.expand(filename) } SEXP R_asFILERef(SEXP r_f, SEXP r_mode) { FILE *f; char *fileName; fileName = CHAR(STRING_ELT(r_f, 0)); f = fopen(fileName, CHAR(STRING_ELT(r_mode, 0))); if(!f) { PROBLEM "cannot open file %s", fileName ERROR; } return(R_MAKE_REF_TYPE(f, FILERef)); } There are two things to note. If we didn't want to define a new function named asFILE, we could inline its contents, i.e. the call to file.exists and path.expand. We would specify this in the typemap as coerceRValue = function(name, parameters, typeMape) RCode(paste("if(!file.exists(", name, "))"), "stop('no such file ', name)", paste(name,"= path.expand(", name, ")") ) Creating an RCode object means that the assignment is not prefixed by the code that uses this, i.e. "f = " Alternatively, we could specify this using NAs for the locations to insert the name. structure(c("if(!file.exists(", NA, "))\n\tstop('no such file'," NA, ")\n", NA, "= path.expand(", name, ")") class = "RCode")

The function form is more flexible. The other two things to note are i)

Dependencies Files