Chapter 5 System and foreign language interfaces
5.1 Operating system access
Access to operating system functions is via the R functions system
and system2
. The details will differ by platform (see the on-line help), and about all that can safely be assumed is that the first argument will be a string command
that will be passed for execution (not necessarily by a shell) and the second argument to system
will be internal
which if true will collect the output of the command into an R character vector.
On POSIX-compliant OSes these commands pass a command-line to a shell: Windows is not POSIX-compliant and there is a separate function shell
to do so.
The function system.time
is available for timing. Timing on child processes is only available on Unix-alikes, and may not be reliable there.
5.2 Interface functions .C
and .Fortran
These two functions provide an interface to compiled code that has been linked into R, either at build time or via dyn.load
(see dyn.load and dyn.unload). They are primarily intended for compiled C and FORTRAN 77 code respectively, but the .C
function can be used with other languages which can generate C interfaces, for example C++ (see Interfacing C++ code).
The first argument to each function is a character string specifying the symbol name as known117 to C or FORTRAN, that is the function or subroutine name. (That the symbol is loaded can be tested by, for example, is.loaded(“cg”)
. Use the name you pass to .C
or .Fortran
rather than the translated symbol name.)
There can be up to 65 further arguments giving R objects to be passed to compiled code. Normally these are copied before being passed in, and copied again to an R list object when the compiled code returns. If the arguments are given names, these are used as names for the components in the returned list object (but not passed to the compiled code).
The following table gives the mapping between the modes of R atomic vectors and the types of arguments to a C function or FORTRAN subroutine.
R storage mode C type FORTRAN type logical
int *
INTEGER
integer
int *
INTEGER
double
double *
DOUBLE PRECISION
complex
Rcomplex *
DOUBLE COMPLEX
character
char **
CHARACTER*255
raw
unsigned char *
none
Do please note the first two. On the 64-bit Unix/Linux/macOS platforms, long
is 64-bit whereas int
and INTEGER
are 32-bit. Code ported from S-PLUS (which uses long *
for logical
and integer
) will not work on all 64-bit platforms (although it may appear to work on some, including Windows). Note also that if your compiled code is a mixture of C functions and FORTRAN subprograms the argument types must match as given in the table above.
C type Rcomplex
is a structure with double
members r
and i
defined in the header file R_ext/Complex.h included by R.h. (On most platforms this is stored in a way compatible with the C99 double complex
type: however, it may not be possible to pass Rcomplex
to a C99 function expecting a double complex
argument. Nor need it be compatible with a C++ complex
type. Moreover, the compatibility can depends on the optimization level set for the compiler.)
Only a single character string can be passed to or from FORTRAN, and the success of this is compiler-dependent. Other R objects can be passed to .C
, but it is much better to use one of the other interfaces.
It is possible to pass numeric vectors of storage mode double
to C as float *
or to FORTRAN as REAL
by setting the attribute Csingle
, most conveniently by using the R functions as.single
, single
or mode
. This is intended only to be used to aid interfacing existing C or FORTRAN code.
Logical values are sent as 0
(FALSE
), 1
(TRUE
) or INT_MIN = -2147483648
(NA
, but only if NAOK
is true), and the compiled code should return one of these three values. (Non-zero values other than INT_MIN
are mapped to TRUE
.)
Unless formal argument NAOK
is true, all the other arguments are checked for missing values NA
and for the IEEE special values NaN
, Inf
and -Inf
, and the presence of any of these generates an error. If it is true, these values are passed unchecked.
Argument PACKAGE
confines the search for the symbol name to a specific shared object (or use “base”
for code compiled into R). Its use is highly desirable, as there is no way to avoid two package writers using the same symbol name, and such name clashes are normally sufficient to cause R to crash. (If it is not present and the call is from the body of a function defined in a package namespace, the shared object loaded by the first (if any) useDynLib
directive will be used.
Note that the compiled code should not return anything except through its arguments: C functions should be of type void
and FORTRAN subprograms should be subroutines.
To fix ideas, let us consider a very simple example which convolves two finite sequences. (This is hard to do fast in interpreted R code, but easy in C code.) We could do this using .C
by
void convolve(double *a, int *na, double *b, int *nb, double *ab)
{
int nab = *na + *nb - 1;
for(int i = 0; i < nab; i++)
ab[i] = 0.0;
for(int i = 0; i < *na; i++)
for(int j = 0; j < *nb; j++)
ab[i + j] += a[i] * b[j];
}
called from R by
conv <- function(a, b)
.C("convolve",
as.double(a),
as.integer(length(a)),
as.double(b),
as.integer(length(b)),
ab = double(length(a) + length(b) - 1))$ab
Note that we take care to coerce all the arguments to the correct R storage mode before calling .C
; mistakes in matching the types can lead to wrong results or hard-to-catch errors.
Special care is needed in handling character
vector arguments in C (or C++). On entry the contents of the elements are duplicated and assigned to the elements of a char
array, and on exit the elements of the C array are copied to create new elements of a character vector. This means that the contents of the character strings of the char
array can be changed, including to \0
to shorten the string, but the strings cannot be lengthened. It is possible118 to allocate a new string via R_alloc
and replace an entry in the char **
array by the new string. However, when character vectors are used other than in a read-only way, the .Call
interface is much to be preferred.
Passing character strings to FORTRAN code needs even more care, and should be avoided where possible. Only the first element of the character vector is passed in, as a fixed-length (255) character array. Up to 255 characters are passed back to a length-one character vector. How well this works (or even if it works at all) depends on the C and FORTRAN compilers on each platform (including on their options). Often what is being passed to FORTRAN is one of a small set of possible values (a factor in R terms) which could alternatively be passed as an integer code: similarly FORTRAN code that wants to generate diagnostic messages can pass an integer code to a C or R wrapper which will convert it to a character string.
It is possible to pass some R objects other than atomic vectors via .C
, but this is only supported for historical compatibility: use the .Call
or .External
interfaces for such objects. Any C/C++ code that includes Rinternals.h should be called via .Call
or .External
.
5.3 dyn.load
and dyn.unload
Compiled code to be used with R is loaded as a shared object (Unix-alikes including macOS, see Creating shared objects for more information) or DLL (Windows).
The shared object/DLL is loaded by dyn.load
and unloaded by dyn.unload
. Unloading is not normally necessary, but it is needed to allow the DLL to be re-built on some platforms, including Windows.
The first argument to both functions is a character string giving the path to the object. Programmers should not assume a specific file extension for the object/DLL (such as .so) but use a construction like
file.path(path1, path2, paste0("mylib", .Platform$dynlib.ext))
for platform independence. On Unix-alike systems the path supplied to dyn.load
can be an absolute path, one relative to the current directory or, if it starts with ‘~’, relative to the user’s home directory.
Loading is most often done automatically based on the useDynLib()
declaration in the NAMESPACE file, but may be done explicitly via a call to library.dynam
. This has the form
library.dynam("libname", package, lib.loc)
where libname
is the object/DLL name with the extension omitted. Note that the first argument, chname
, should not be package
since this will not work if the package is installed under another name.
Under some Unix-alike systems there is a choice of how the symbols are resolved when the object is loaded, governed by the arguments local
and now
. Only use these if really necessary: in particular using now=FALSE
and then calling an unresolved symbol will terminate R unceremoniously.
R provides a way of executing some code automatically when a object/DLL is either loaded or unloaded. This can be used, for example, to register native routines with R’s dynamic symbol mechanism, initialize some data in the native code, or initialize a third party library. On loading a DLL, R will look for a routine within that DLL named R_init_lib
where lib is the name of the DLL file with the extension removed. For example, in the command
library.dynam("mylib", package, lib.loc)
R looks for the symbol named R_init_mylib
. Similarly, when unloading the object, R looks for a routine named R_unload_lib
, e.g., R_unload_mylib
. In either case, if the routine is present, R will invoke it and pass it a single argument describing the DLL. This is a value of type DllInfo
which is defined in the Rdynload.h file in the R_ext directory.
Note that there are some implicit restrictions on this mechanism as the basename of the DLL needs to be both a valid file name and valid as part of a C entry point (e.g. it cannot contain ‘.’): for portable code it is best to confine DLL names to be ASCII alphanumeric plus underscore. If entry point R_init_lib
is not found it is also looked for with ‘.’ replaced by ‘_’.
The following example shows templates for the initialization and unload routines for the mylib
DLL.
#include <R_ext/Rdynload.h> void R_init_mylib(DllInfo *info) { /* Register routines, allocate resources. */ } void R_unload_mylib(DllInfo *info) { /* Release resources. */ }
If a shared object/DLL is loaded more than once the most recent version is used.119 More generally, if the same symbol name appears in several shared objects, the most recently loaded occurrence is used. The PACKAGE
argument and registration (see the next section) provide good ways to avoid any ambiguity in which occurrence is meant.
On Unix-alikes the paths used to resolve dynamically linked dependent libraries are fixed (for security reasons) when the process is launched, so dyn.load
will only look for such libraries in the locations set by the R shell script (via etc/ldpaths) and in the OS-specific defaults.
Windows allows more control (and less security) over where dependent DLLs are looked for. On all versions this includes the PATH
environment variable, but with lowest priority: note that it does not include the directory from which the DLL was loaded. It is possible to add a single path with quite high priority via the DLLpath
argument to dyn.load
. This is (by default) used by library.dynam
to include the package’s libs/i386 or libs/x64 directory in the DLL search path.
5.4 Registering native routines
By ‘native’ routine, we mean an entry point in compiled code.
In calls to .C
, .Call
, .Fortran
and .External
, R must locate the specified native routine by looking in the appropriate shared object/DLL. By default, R uses the operating-system-specific dynamic loader to lookup the symbol in all120 loaded DLLs and the R executable or libraries it is linked to. Alternatively, the author of the DLL can explicitly register routines with R and use a single, platform-independent mechanism for finding the routines in the DLL. One can use this registration mechanism to provide additional information about a routine, including the number and type of the arguments, and also make it available to R programmers under a different name.
Registering routines has two main advantages: it provides a faster121 way to find the address of the entry point via tables stored in the DLL at compilation time, and it provides a run-time check that the entry point is called with the right number of arguments and, optionally, the right argument types.
To register routines with R, one calls the C routine R_registerRoutines
. This is typically done when the DLL is first loaded within the initialization routine R_init_dll name
described in dyn.load and dyn.unload. R_registerRoutines
takes 5 arguments. The first is the DllInfo
object passed by R to the initialization routine. This is where R stores the information about the methods. The remaining 4 arguments are arrays describing the routines for each of the 4 different interfaces: .C
, .Call
, .Fortran
and .External
. Each argument is a NULL
-terminated array of the element types given in the following table:
.C
R_CMethodDef
.Call
R_CallMethodDef
.Fortran
R_FortranMethodDef
.External
R_ExternalMethodDef
Currently, the R_ExternalMethodDef
type is the same as R_CallMethodDef
type and contains fields for the name of the routine by which it can be accessed in R, a pointer to the actual native symbol (i.e., the routine itself), and the number of arguments the routine expects to be passed from R. For example, if we had a routine named myCall
defined as
SEXP myCall(SEXP a, SEXP b, SEXP c);
we would describe this as
static const R_CallMethodDef callMethods[] = {
{"myCall", (DL_FUNC) &myCall, 3},
{NULL, NULL, 0}
};
along with any other routines for the .Call
interface. For routines with a variable number of arguments invoked via the .External
interface, one specifies -1
for the number of arguments which tells R not to check the actual number passed.
Routines for use with the .C
and .Fortran
interfaces are described with similar data structures, but which have two additional fields for describing the type and “style” of each argument. Each of these can be omitted. However, if specified, each should be an array with the same number of elements as the number of parameters for the routine. The types array should contain the SEXP
types describing the expected type of the argument. (Technically, the elements of the types array are of type R_NativePrimitiveArgType
which is just an unsigned integer.) The R types and corresponding type identifiers are provided in the following table:
numeric
REALSXP
integer
INTSXP
logical
LGLSXP
single
SINGLESXP
character
STRSXP
list
VECSXP
Consider a C routine, myC
, declared as
void myC(double *x, int *n, char **names, int *status);
We would register it as
static R_NativePrimitiveArgType myC_t[] = {
REALSXP, INTSXP, STRSXP, LGLSXP
};
static const R_CMethodDef cMethods[] = {
{"myC", (DL_FUNC) &myC, 4, myC_t},
{NULL, NULL, 0, NULL}
};
Note that .Fortran
entry points are mapped to lowercase, so registration should use lowercase only.
Having created the arrays describing each routine, the last step is to actually register them with R. We do this by calling R_registerRoutines
. For example, if we have the descriptions above for the routines accessed by the .C
and .Call
we would use the following code:
void
R_init_myLib(DllInfo *info)
{
R_registerRoutines(info, cMethods, callMethods, NULL, NULL);
}
This routine will be invoked when R loads the shared object/DLL named myLib
. The last two arguments in the call to R_registerRoutines
are for the routines accessed by .Fortran
and .External
interfaces. In our example, these are given as NULL
since we have no routines of these types.
When R unloads a shared object/DLL, its registrations are removed. There is no other facility for unregistering a symbol.
Examples of registering routines can be found in the different packages in the R source tree (e.g., stats and graphics). Also, there is a brief, high-level introduction in R News (volume 1/3, September 2001, pages 20–23, https://www.r-project.org/doc/Rnews/Rnews_2001-3.pdf).
Once routines are registered, they can be referred to as R objects if they this is arranged in the useDynLib
call in the package’s NAMESPACE file (see useDynLib). So for example the stats package has
# Refer to all C/Fortran routines by their name prefixed by C_
useDynLib(stats, .registration = TRUE, .fixes = "C_")
in its NAMESPACE file, and then ansari.test
’s default methods can contain
pansari <- function(q, m, n)
.C(C_pansari, as.integer(length(q)), p = as.double(q),
as.integer(m), as.integer(n))$p
This avoids the overhead of looking up an entry point each time it is used, and ensures that the entry point in the package is the one used (without a PACKAGE = “pkg”
argument).
R_init_
routines are often of the form
void attribute_visible R_init_mypkg(DllInfo *dll)
{
R_registerRoutines(dll, CEntries, CallEntries, FortEntries,
ExternalEntries);
R_useDynamicSymbols(dll, FALSE);
R_forceSymbols(dll, TRUE);
...
}
The R_useDynamicSymbols
call says the DLL is not to be searched for entry points specified by character strings so .C
etc calls will only find registered symbols: the R_forceSymbols
call only allows .C
etc calls which specify entry points by R objects such as C_pansari
(and not by character strings). Each provides some protection against accidentally finding your entry points when people supply a character string without a package, and avoids slowing down such searches. Routine R_forceSymbols
is available from R 3.0.0, so packages using it should have a dependency on at least ‘R (>= 3.0.0)’. (For the visibility attribute see Controlling visibility.)
In more detail, if a package mypkg
contains entry points reg
and unreg
and the first is registered as a 0-argument .Call
routine, we could use (from code in the package)
.Call("reg")
.Call("unreg")
Without or with registration, these will both work. If R_init_mypkg
calls R_useDynamicSymbols(dll, FALSE)
, only the first will work. If in addition to registration the NAMESPACE file contains
useDynLib(mypkg, .registration = TRUE, .fixes = "C_")
then we can call .Call(C_reg)
. Finally, if R_init_mypkg
also calls R_forceSymbols(dll, TRUE)
, only .Call(C_reg)
will work (and not .Call(“reg”)
). This is usually what we want: it ensures that all of our own .Call
calls go directly to the intended code in our package and that no one else accidentally finds our entry points. (Should someone need to call our code from outside the package, for example for debugging, they can use .Call(mypkg:::C_reg)
.)
5.4.1 Speed considerations
Sometimes registering native routines or using a PACKAGE
argument can make a large difference. The results can depend quite markedly on the OS (and even if it is 32- or 64-bit), on the version of R and what else is loaded into R at the time.
To fix ideas, first consider x84_64
OS 10.7 and R 2.15.2. A simple .Call
function might be
foo <- function(x) .Call("foo", x)
with C code
#include <Rinternals.h>
SEXP foo(SEXP x)
{
return x;
}
If we compile with by R CMD SHLIB foo.c
, load the code by dyn.load(“foo.so”)
and run foo(pi)
it took around 22 microseconds (us). Specifying the DLL by
foo2 <- function(x) .Call("foo", x, PACKAGE = "foo")
reduced the time to 1.7 us.
Now consider making these functions part of a package whose NAMESPACE file uses useDynlib(foo)
. This immediately reduces the running time as “foo”
will be preferentially looked for foo.dll. Without specifying PACKAGE
it took about 5 us (it needs to fathom out the appropriate DLL each time it is invoked but it does not need to search all DLLs), and with the PACKAGE
argument it is again about 1.7 us.
Next suppose the package has registered the native routine foo
. Then foo()
still has to find the appropriate DLL but can get to the entry point in the DLL faster, in about 4.2 us. And foo2()
now takes about 1 us. If we register the symbols in the NAMESPACE file and use
foo3 <- function(x) .Call(C_foo, x)
then the address for the native routine is looked up just once when the package is loaded, and foo3(pi)
takes about 0.8 us.
Versions using .C()
rather than .Call()
took about 0.2 us longer.
These are all quite small differences, but C routines are not uncommonly invoked millions of times for run times of a few microseconds each, and those doing such things may wish to be aware of the differences.
On Linux and Solaris there is a smaller overhead in looking up symbols.
Symbol lookup on Windows used to be far slower, so R maintains a small cache. If the cache is currently empty enough that the symbol can be stored in the cache then the performance is similar to Linux and Solaris: if not it may be slower. R’s own code always uses registered symbols and so these never contribute to the cache: however many other packages do rely on symbol lookup.
In more recent versions of R all the standard packages register native symbols and do not allow symbol search, so in a new session foo()
can only look in foo.so and may be as fast as foo2()
. This will no longer apply when many contributed packages are loaded, and generally those last loaded are searched first. For example, consider R 3.3.2 on x86_64 Linux. In an empty R session, both foo()
and foo2()
took about 0.75 us; however after packages igraph and spatstat had been loaded (which loaded another 12 DLLs), foo()
took 3.6 us but foo2()
still took about 0.80 us. Using registration in a package reduced this to 0.55 us and foo3()
took 0.40 us, times which were unchanged when further packages were loaded.
5.4.2 Example: converting a package to use registration
The splines package was converted to use symbol registration in 2001, but we can use it as an example122 of what needs to be done for a small package.
-
Find the relevant entry points. This is somewhat OS-specific, but something like the following should be possible at the OS command-line
nm -g /path/to/splines.so | grep " T " 000000000002670 T _spline_basis 000000000001ec0 T _spline_value
This indicates that there are two relevant entry points. (They may or may not have a leading underscore, as here. Fortran entry points will have a trailing underscore.) Check in the R code that they are called by the package and how: in this case they are used by
.Call
.Alternatively, examine the package’s R code for all
.C
,.Fortran
,.Call
and.External
calls. -
Construct the registration table. First write skeleton registration code, conventionally in file src/init.c (or at the end of the only C source file in the package: if included in a C++ file the ‘R_init’ function would need to be declared
extern “C”
):#include <stdlib.h> // for NULL #include <R_ext/Rdynload.h> #define CALLDEF(name, n) {#name, (DL_FUNC) &name, n} static const R_CallMethodDef R_CallDef[] = { CALLDEF(spline_basis, ?), CALLDEF(spline_value, ?), {NULL, NULL, 0} }; void R_init_splines(DllInfo *dll) { R_registerRoutines(dll, NULL, R_CallDef, NULL, NULL); }
and then replace the
?
in the skeleton with the actual numbers of arguments. You will need to add declarations (also known as ‘prototypes’) of the functions unless appending to the only C source file. Some packages will already have these in a header file, or you could create one and include it in init.c, for example splines.h containing#include <Rinternals.h> // for SEXP extern SEXP spline_basis(SEXP knots, SEXP order, SEXP xvals, SEXP derivs); extern SEXP spline_value(SEXP knots, SEXP coeff, SEXP order, SEXP x, SEXP deriv);
Tools are available to extract declarations, at least for C and C++ code: see the help file for
package_native_routine_registration_skeleton
in package tools. Here we could have usedcproto -I/path/to/R/include -e splines.c
For examples of registering other types of calls, see packages graphics and stats. In particular, when registering entry points for
.Fortran
one needs declarations as if called from C, such as#include <R_ext/RS.h> void F77_NAME(supsmu)(int *n, double *x, double *y, double *w, int *iper, double *span, double *alpha, double *smo, double *sc, double *edf);
One can get away with inaccurate argument lists in the declarations: it is easy to specify the arguments for
.Call
(allSEXP
) and.External
(oneSEXP
) and as the arguments for.C
and.Fortran
are all pointers, specifying them asvoid *
suffices. (For most platforms one can omit all the arguments.) -
(Optional but highly recommended.) Restrict
.Call
etc to using the symbols you chose to register by editing src/init.c to containvoid R_init_splines(DllInfo *dll) { R_registerRoutines(dll, NULL, R_CallDef, NULL, NULL); R_useDynamicSymbols(dll, FALSE); }
A skeleton for the steps so far can be made using package_native_routine_registration_skeleton
in package tools. This will optionally create declarations based on the usage in the R code.
The remaining steps are optional but recommended.
-
Edit the NAMESPACE file to create R objects for the registered symbols:
useDynLib(splines, .registration = TRUE, .fixes = "C_")
-
Find all the relevant calls in the R code and edit them to use the R objects. This entailed changing the lines
temp <- .Call("spline_basis", knots, ord, x, derivs, PACKAGE = "splines") y[accept] <- .Call("spline_value", knots, coeff, ord, x[accept], deriv, PACKAGE = "splines") y = .Call("spline_value", knots, coef(object), ord, x, deriv, PACKAGE = "splines")
to
temp <- .Call(C_spline_basis, knots, ord, x, derivs) y[accept] <- .Call(C_spline_value, knots, coeff, ord, x[accept], deriv) y = .Call(C_spline_value, knots, coef(object), ord, x, deriv)
Check that there is no
exportPattern
directive which unintentionally exports the newly created R objects. -
Restrict
.Call
to using the R symbols by editing src/init.c to containvoid R_init_splines(DllInfo *dll) { R_registerRoutines(dll, NULL, R_CallDef, NULL, NULL); R_useDynamicSymbols(dll, FALSE); R_forceSymbols(dll, TRUE); }
-
Consider visibility. On some OSes we can hide entry points from the loader, which precludes any possible name clashes and calling them accidentally (usually with incorrect arguments and crashing the R process). If we repeat the first step we now see
nm -g /path/to/splines.so | grep " T " 000000000002e00 T _R_init_splines 0000000000025e0 T _spline_basis 000000000001e20 T _spline_value
If there were any entry points not intended to be used by the package we should try to avoid exporting them, for example by making them
static
. Now the two relevant entry points are only accessed via the registration table, we can hide them. There are two ways to do so on some Unix-alikes. We can hide individual entry points via#include <R_ext/Visibility.h> SEXP attribute_hidden spline_basis(SEXP knots, SEXP order, SEXP xvals, SEXP derivs) … SEXP attribute_hidden spline_value(SEXP knots, SEXP coeff, SEXP order, SEXP x, SEXP deriv) …
Alternatively, we can change the default visibility for all C symbols by including
PKG_CFLAGS = $(C_VISIBILITY)
in src/Makevars, and then we need to allow registration by declaring
R_init_splines
to be visible:#include <R_ext/Visibility.h> void attribute_visible R_init_splines(DllInfo *dll) …
See Controlling visibility for more details, including using Fortran code and ways to restrict visibility on Windows.
-
We end up with a file src/init.c containing
#include <stdlib.h> #include <R_ext/Rdynload.h> #include <R_ext/Visibility.h> // optional #include "splines.h" #define CALLDEF(name, n) {#name, (DL_FUNC) &name, n} static const R_CallMethodDef R_CallDef[] = { CALLDEF(spline_basis, 4), CALLDEF(spline_value, 5), {NULL, NULL, 0} }; void attribute_visible // optional R_init_splines(DllInfo *dll) { R_registerRoutines(dll, NULL, R_CallDef, NULL, NULL); R_useDynamicSymbols(dll, FALSE); R_forceSymbols(dll, TRUE); }
5.4.3 Linking to native routines in other packages
In addition to registering C routines to be called by R, it can at times be useful for one package to make some of its C routines available to be called by C code in another package. The interface consists of two routines declared in header R_ext/Rdynload.h as
void R_RegisterCCallable(const char *package, const char *name,
DL_FUNC fptr);
DL_FUNC R_GetCCallable(const char *package, const char *name);
A package packA that wants to make a C routine myCfun
available to C code in other packages would include the call
R_RegisterCCallable("packA", "myCfun", myCfun);
in its initialization function R_init_packA
. A package packB that wants to use this routine would retrieve the function pointer with a call of the form
p_myCfun = R_GetCCallable("packA", "myCfun");
The author of packB is responsible for ensuring that p_myCfun
has an appropriate declaration. In the future R may provide some automated tools to simplify exporting larger numbers of routines.
A package that wishes to make use of header files in other packages needs to declare them as a comma-separated list in the field ‘LinkingTo’ in the DESCRIPTION file. This then arranges that the include directories in the installed linked-to packages are added to the include paths for C and C++ code.
It must specify123 ‘Imports’ or ‘Depends’ of those packages, for they have to be loaded124 prior to this one (so the path to their compiled code has been registered).
CRAN examples of the use of this mechanism include coxme linking to bdsmatrix and xts linking to zoo
5.5 Creating shared objects
Shared objects for loading into R can be created using R CMD SHLIB
. This accepts as arguments a list of files which must be object files (with extension .o) or sources for C, C++, FORTRAN 77, Fortran 9x, Objective C or Objective C++ (with extensions .c, .cc or .cpp, .f, .f90 or .f95, .m, and .mm or .M, respectively), or commands to be passed to the linker. See R CMD SHLIB –help (or the R help for SHLIB
) for usage information.
If compiling the source files does not work “out of the box”, you can specify additional flags by setting some of the variables PKG_CPPFLAGS
(for the C preprocessor, typically ‘-I’ flags), PKG_CFLAGS
, PKG_CXXFLAGS
, PKG_FFLAGS
, PKG_FCFLAGS
, PKG_OBJCFLAGS
, and PKG_OBJCXXFLAGS
(for the C, C++, FORTRAN 77, Fortran 9x, Objective C, and Objective C++ compilers, respectively) in the file Makevars in the compilation directory (or, of course, create the object files directly from the command line). Similarly, variable PKG_LIBS
in Makevars can be used for additional ‘-l’ and ‘-L’ flags to be passed to the linker when building the shared object. (Supplying linker commands as arguments to R CMD SHLIB
will take precedence over PKG_LIBS
in Makevars.)
It is possible to arrange to include compiled code from other languages by setting the macro ‘OBJECTS’ in file Makevars, together with suitable rules to make the objects.
Flags which are already set (for example in file etcR_ARCH/Makeconf) can be overridden by the environment variable MAKEFLAGS
(at least for systems using a POSIX-compliant make
), as in (Bourne shell syntax)
MAKEFLAGS="CFLAGS=-O3" R CMD SHLIB *.c
It is also possible to set such variables in personal Makevars files, which are read after the local Makevars and the system makefiles or in a site-wide Makevars.site file.
Note that as R CMD SHLIB
uses Make, it will not remake a shared object just because the flags have changed, and if test.c and test.f both exist in the current directory
R CMD SHLIB test.f
will compile test.c!
If the src subdirectory of an add-on package contains source code with one of the extensions listed above or a file Makevars but not a file Makefile, R CMD INSTALL
creates a shared object (for loading into R through useDynlib
in the NAMESPACE, or in the .onLoad
function of the package) using the R CMD SHLIB
mechanism. If file Makevars exists it is read first, then the system makefile and then any personal Makevars files.
If the src subdirectory of package contains a file Makefile, this is used by R CMD INSTALL
in place of the R CMD SHLIB
mechanism. make
is called with makefiles R_HOME/etcR_ARCH/Makeconf, src/Makefile and any personal Makevars files (in that order). The first target found in src/Makefile is used.
It is better to make use of a Makevars file rather than a Makefile: the latter should be needed only exceptionally.
Under Windows the same commands work, but Makevars.win will be used in preference to Makevars, and only src/Makefile.win will be used by R CMD INSTALL
with src/Makefile being ignored. For past experiences of building DLLs with a variety of compilers, see file ‘README.packages’ and http://www.stats.uwo.ca/faculty/murdoch/software/compilingDLLs/ . Under Windows you can supply an exports definitions file called dllname-win.def: otherwise all entry points in objects (but not libraries) supplied to R CMD SHLIB
will be exported from the DLL. An example is stats-win.def for the stats package: a CRAN example in package fastICA.
If you feel tempted to read the source code and subvert these mechanisms, please resist. Far too much developer time has been wasted in chasing down errors caused by failures to follow this documentation, and even more by package authors demanding explanations as to why their packages no longer work. In particular, undocumented environment or make
variables are not for use by package writers and are subject to change without notice.
5.6 Interfacing C++ code
Suppose we have the following hypothetical C++ library, consisting of the two files X.h and X.cpp, and implementing the two classes X
and Y
which we want to use in R.
// X.h class X { public: X (); ~X (); }; class Y { public: Y (); ~Y (); };
// X.cpp #include <R.h> #include "X.h" static Y y; X::X() { REprintf("constructor X\n"); } X::~X() { REprintf("destructor X\n"); } Y::Y() { REprintf("constructor Y\n"); } Y::~Y() { REprintf("destructor Y\n"); }
To use with R, the only thing we have to do is writing a wrapper function and ensuring that the function is enclosed in
extern "C" {
}
For example,
// X_main.cpp: #include "X.h" extern "C" { void X_main () { X x; } } // extern "C"
Compiling and linking should be done with the C++ compiler-linker (rather than the C compiler-linker or the linker itself); otherwise, the C++ initialization code (and hence the constructor of the static variable Y
) are not called. On a properly configured system, one can simply use
R CMD SHLIB X.cpp X_main.cpp
to create the shared object, typically X.so (the file name extension may be different on your platform). Now starting R yields
R version 2.14.1 Patched (2012-01-16 r58124)
Copyright (C) 2012 The R Foundation for Statistical Computing
...
Type "q()" to quit R.
R> dyn.load(paste("X", .Platform$dynlib.ext, sep = ""))
constructor Y
R> .C("X_main")
constructor X
destructor X
list()
R> q()
Save workspace image? [y/n/c]: y
destructor Y
The R for Windows FAQ (rw-FAQ) contains details of how to compile this example under Windows.
Earlier versions of this example used C++ iostreams: this is best avoided. There is no guarantee that the output will appear in the R console, and indeed it will not on the R for Windows console. Use R code or the C entry points (see Printing) for all I/O if at all possible. Examples have been seen where merely loading a DLL that contained calls to C++ I/O upset R’s own C I/O (for example by resetting buffers on open files).
Most R header files can be included within C++ programs but they should not be included within an extern “C”
block (as they include system headers125). The inclusion of system headers in C++ changed in R 3.3.0126, so if you care about earlier versions of R please check your package there.
Legacy header S.h cannot be used with C++.
5.6.1 External C++ code
Quite a lot of external C++ software is header-only (e.g. most of the Boost ‘libraries’ including all those supplied by package BH, and most of Armadillo as supplied by package RcppArmadillo) and so is compiled when an R package which uses it is installed. This causes few problems.
A small number of external libraries used in R packages have a C++ interface to a library of compiled code, e.g. packages rgdal and rjags. This raises many more problems! The C++ interface uses name-mangling and the ABI127 may depend on the compiler, version and even C++ defines128, so requires the package C++ code to be compiled in exactly the same way as the library (and what that was is often undocumented). Examples include use of g++
vs clang++
or Solaris’ CC
, and the two ABIs available for C++11 in g++
with different defaults for GCC 4.9 and 5.x in some Linux distributions.
Even fewer external libraries use C++ internally but present a C interface, such as rgeos. These require the C++ runtime library to be linked into the package’s shared object/DLL, and this is best done by including a dummy C++ file in the package sources.
There is a recent trend to link to the C++ interfaces offered by C software such as hdf5, pcre and ImageMagick. Their C interfaces are much preferred for portability (and can be used from C++ code). Also, the C++ interfaces are often optional in the software build or packaged separately and so users installing from package sources are far less likely to already have them installed.
5.7 Fortran I/O
We have already warned against the use of C++ iostreams not least because output is not guaranteed to appear on the R console, and this warning applies equally to Fortran (77 or 9x) output to units *
and 6
. See Printing from FORTRAN, which describes workarounds.
In the past most Fortran compilers implemented I/O on top of the C I/O system and so the two interworked successfully. This was true of g77
, but it is less true of gfortran
as used in gcc
4 and later. In particular, any package that makes use of Fortran I/O will when compiled on Windows interfere with C I/O: when the Fortran I/O is initialized (typically when the package is loaded) the C stdout
and stderr
are switched to LF line endings. (Function init
in file src/modules/lapack/init_win.c shows how to mitigate this.)
5.8 Linking to other packages
It is not in general possible to link a DLL in package packA to a DLL provided by package packB (for the security reasons mentioned in dyn.load and dyn.unload, and also because some platforms distinguish between shared objects and dynamic libraries), but it is on Windows.
Note that there can be tricky versioning issues here, as package packB could be re-installed after package packA — it is desirable that the API provided by package packB remains backwards-compatible.
Shipping a static library in package packB for other packages to link to avoids most of the difficulties.
5.8.1 Unix-alikes
It is possible to link a shared object in package packA to a library provided by package packB under limited circumstances on a Unix-alike OS. There are severe portability issues, so this is not recommended for a distributed package.
This is easiest if packB provides a static library packB/lib/libpackB.a. (Note using directory lib rather than libs is conventional, and architecture-specific sub-directories may be needed and are assumed in the sample code below. The code in the static library will need to be compiled with PIC
flags on platforms where it matters.) Then as the code from package packB is incorporated when package packA is installed, we only need to find the static library at install time for package packA. The only issue is to find package packB, and for that we can ask R by something like (long lines broken for display here)
PKGB_PATH=‘echo ’library(packB);
cat(system.file("lib", package="packB", mustWork=TRUE))' \
| "${R_HOME}/bin/R" --vanilla --slave`
PKG_LIBS="$(PKGB_PATH)$(R_ARCH)/libpackB.a"
For a dynamic library packB/lib/libpackB.so (packB/lib/libpackB.dylib on macOS: note that you cannot link to a shared object, .so, on that platform) we could use
PKGB_PATH=‘echo ’library(packB);
cat(system.file("lib", package="packB", mustWork=TRUE))' \
| "${R_HOME}/bin/R" --vanilla --slave`
PKG_LIBS=-L"$(PKGB_PATH)$(R_ARCH)" -lpackB
This will work for installation, but very likely not when package packB
is loaded, as the path to package packB’s lib directory is not in the ld.so
129 search path. You can arrange to put it there before R is launched by setting (on some platforms) LD_RUN_PATH
or LD_LIBRARY_PATH
or adding to the ld.so
cache (see man ldconfig
). On platforms that support it, the path to the directory containing the dynamic library can be hardcoded at install time (which assumes that the location of package packB will not be changed nor the package updated to a changed API). On systems with the gcc
or clang
and the GNU linker (e.g. Linux) and some others this can be done by e.g.
PKGB_PATH=‘echo ’library(packB);
cat(system.file("lib", package="packB", mustWork=TRUE)))' \
| "${R_HOME}/bin/R" --vanilla --slave`
PKG_LIBS=-L"$(PKGB_PATH)$(R_ARCH)" -Wl,-rpath,"$(PKGB_PATH)$(R_ARCH)" -lpackB
Some other systems (e.g. Solaris with its native linker) use -Rdir rather than -rpath,dir (and this is accepted by the compiler as well as the linker).
It may be possible to figure out what is required semi-automatically from the result of R CMD libtool –config
(look for ‘hardcode’).
Making headers provided by package packB available to the code to be compiled in package packA can be done by the LinkingTo
mechanism (see Registering native routines).
5.8.2 Windows
Suppose package packA wants to make use of compiled code provided by packB in DLL packB/libs/exB.dll, possibly the package’s DLL packB/libs/packB.dll. (This can be extended to linking to more than one package in a similar way.) There are three issues to be addressed:
-
Making headers provided by package packB available to the code to be compiled in package packA.
This is done by the
LinkingTo
mechanism (see Registering native routines). -
preparing
packA.dll
to link to packB/libs/exB.dll.This needs an entry in Makevars.win of the form
PKG_LIBS= -L<something> -lexB
and one possibility is that
<something>
is the path to the installed pkgB/libs directory. To find that we need to ask R where it is by something likePKGB_PATH=‘echo ’library(packB); cat(system.file("libs", package="packB", mustWork=TRUE))' \ | rterm --vanilla --slave` PKG_LIBS= -L"$(PKGB_PATH)$(R_ARCH)" -lexB
Another possibility is to use an import library, shipping with package packA an exports file exB.def. Then Makevars.win could contain
PKG_LIBS= -L. -lexB all: $(SHLIB) before before: libexB.dll.a libexB.dll.a: exB.def
and then installing package packA will make and use the import library for exB.dll. (One way to prepare the exports file is to use pexports.exe.)
-
loading packA.dll which depends on exB.dll.
If
exB.dll
was used by package packB (because it is in fact packB.dll or packB.dll depends on it) and packB has been loaded before packA, then nothing more needs to be done as exB.dll will already be loaded into the R executable. (This is the most common scenario.)More generally, we can use the
DLLpath
argument tolibrary.dynam
to ensure thatexB.dll
is found, for example by settinglibrary.dynam("packA", pkg, lib, DLLpath = system.file("libs", package="packB"))
Note that
DLLpath
can only set one path, and so for linking to two or more packages you would need to resort to setting environment variablePATH
.
5.9 Handling R objects in C
Using C code to speed up the execution of an R function is often very fruitful. Traditionally this has been done via the .C
function in R. However, if a user wants to write C code using internal R data structures, then that can be done using the .Call
and .External
functions. The syntax for the calling function in R in each case is similar to that of .C
, but the two functions have different C interfaces. Generally the .Call
interface is simpler to use, but .External
is a little more general.
A call to .Call
is very similar to .C
, for example
.Call("convolve2", a, b)
The first argument should be a character string giving a C symbol name of code that has already been loaded into R. Up to 65 R objects can passed as arguments. The C side of the interface is
#include <R.h>
#include <Rinternals.h>
SEXP convolve2(SEXP a, SEXP b)
...
A call to .External
is almost identical
.External("convolveE", a, b)
but the C side of the interface is different, having only one argument
#include <R.h>
#include <Rinternals.h>
SEXP convolveE(SEXP args)
...
Here args
is a LISTSXP
, a Lisp-style pairlist from which the arguments can be extracted.
In each case the R objects are available for manipulation via a set of functions and macros defined in the header file Rinternals.h or some S-compatibility macros130 defined in Rdefines.h. See Interface functions .Call and .External for details on .Call
and .External
.
Before you decide to use .Call
or .External
, you should look at other alternatives. First, consider working in interpreted R code; if this is fast enough, this is normally the best option. You should also see if using .C
is enough. If the task to be performed in C is simple enough involving only atomic vectors and requiring no call to R, .C
suffices. A great deal of useful code was written using just .C
before .Call
and .External
were available. These interfaces allow much more control, but they also impose much greater responsibilities so need to be used with care. Neither .Call
nor .External
copy their arguments: you should treat arguments you receive through these interfaces as read-only.
To handle R objects from within C code we use the macros and functions that have been used to implement the core parts of R. A public131 subset of these is defined in the header file Rinternals.h in the directory R_INCLUDE_DIR (default R_HOME/include) that should be available on any R installation.
A substantial amount of R, including the standard packages, is implemented using the functions and macros described here, so the R source code provides a rich source of examples and “how to do it”: do make use of the source code for inspirational examples.
It is necessary to know something about how R objects are handled in C code. All the R objects you will deal with will be handled with the type SEXP132, which is a pointer to a structure with typedef SEXPREC
. Think of this structure as a variant type that can handle all the usual types of R objects, that is vectors of various modes, functions, environments, language objects and so on. The details are given later in this section and in R Internal Structures in R Internals, but for most purposes the programmer does not need to know them. Think rather of a model such as that used by Visual Basic, in which R objects are handed around in C code (as they are in interpreted R code) as the variant type, and the appropriate part is extracted for, for example, numerical calculations, only when it is needed. As in interpreted R code, much use is made of coercion to force the variant object to the right type.
5.9.1 Handling the effects of garbage collection
We need to know a little about the way R handles memory allocation. The memory allocated for R objects is not freed by the user; instead, the memory is from time to time garbage collected. That is, some or all of the allocated memory not being used is freed or marked as re-usable.
The R object types are represented by a C structure defined by a typedef SEXPREC
in Rinternals.h. It contains several things among which are pointers to data blocks and to other SEXPREC
s. A SEXP
is simply a pointer to a SEXPREC
.
If you create an R object in your C code, you must tell R that you are using the object by using the PROTECT
macro on a pointer to the object. This tells R that the object is in use so it is not destroyed during garbage collection. Notice that it is the object which is protected, not the pointer variable. It is a common mistake to believe that if you invoked PROTECT(p)
at some point then p is protected from then on, but that is not true once a new object is assigned to p.
Protecting an R object automatically protects all the R objects pointed to in the corresponding SEXPREC
, for example all elements of a protected list are automatically protected.
The programmer is solely responsible for housekeeping the calls to PROTECT
. There is a corresponding macro UNPROTECT
that takes as argument an int
giving the number of objects to unprotect when they are no longer needed. The protection mechanism is stack-based, so UNPROTECT(n)
unprotects the last n objects which were protected. The calls to PROTECT
and UNPROTECT
must balance when the user’s code returns. R will warn about “stack imbalance in .Call”
(or .External
) if the housekeeping is wrong.
Here is a small example of creating an R numeric vector in C code:
#include <R.h>
#include <Rinternals.h>
SEXP ab;
....
ab = PROTECT(allocVector(REALSXP, 2));
REAL(ab)[0] = 123.45;
REAL(ab)[1] = 67.89;
UNPROTECT(1);
Now, the reader may ask how the R object could possibly get removed during those manipulations, as it is just our C code that is running. As it happens, we can do without the protection in this example, but in general we do not know (nor want to know) what is hiding behind the R macros and functions we use, and any of them might cause memory to be allocated, hence garbage collection and hence our object ab
to be removed. It is usually wise to err on the side of caution and assume that any of the R macros and functions might remove the object.
In some cases it is necessary to keep better track of whether protection is really needed. Be particularly aware of situations where a large number of objects are generated. The pointer protection stack has a fixed size (default 10,000) and can become full. It is not a good idea then to just PROTECT
everything in sight and UNPROTECT
several thousand objects at the end. It will almost invariably be possible to either assign the objects as part of another object (which automatically protects them) or unprotect them immediately after use.
Protection is not needed for objects which R already knows are in use. In particular, this applies to function arguments.
There is a less-used macro UNPROTECT_PTR(s)
that unprotects the object pointed to by the SEXP
s, even if it is not the top item on the pointer protection stack. This is rarely needed outside the parser (the R sources currently have three examples, one in src/main/plot3d.c).
Sometimes an object is changed (for example duplicated, coerced or grown) yet the current value needs to be protected. For these cases PROTECT_WITH_INDEX
saves an index of the protection location that can be used to replace the protected value using REPROTECT
. For example (from the internal code for optim
)
PROTECT_INDEX ipx;
....
PROTECT_WITH_INDEX(s = eval(OS->R_fcall, OS->R_env), &ipx);
REPROTECT(s = coerceVector(s, REALSXP), ipx);
Note that it is dangerous to mix UNPROTECT_PTR
with PROTECT_WITH_INDEX
, as the former changes the protection locations of objects that were protected after the one being unprotected.
There is another way to avoid the affects of garbage collection: a call to R_PreserveObject
adds an object to an internal list of objects not to be collects, and a subsequent call to R_ReleaseObject
removes it from that list. This provides a way for objects which are not returned as part of R objects to be protected across calls to compiled code: on the other hand it becomes the user’s responsibility to release them when they are no longer needed (and this often requires the use of a finalizer). It is less efficient that the normal protection mechanism, and should be used sparingly.
5.9.2 Allocating storage
For many purposes it is sufficient to allocate R objects and manipulate those. There are quite a few allocXxx
functions defined in Rinternals.h—you may want to explore them.
One that is commonly used is allocVector
, the C-level equivalent of R-level vector()
and its wrappers such as integer()
and character()
. One distinction is that whereas the R functions always initialize the elements of the vector, allocVector
only does so for lists, expressions and character vectors (the cases where the elements are themselves R objects).
If storage is required for C objects during the calculations this is best allocating by calling R_alloc
; see Memory allocation. All of these memory allocation routines do their own error-checking, so the programmer may assume that they will raise an error and not return if the memory cannot be allocated.
5.9.3 Details of R types
Users of the Rinternals.h macros will need to know how the R types are known internally. The different R data types are represented in C by SEXPTYPE. Some of these are familiar from R and some are internal data types. The usual R object modes are given in the table.
SEXPTYPE R equivalent REALSXP
numeric with storage mode double
INTSXP
integer CPLXSXP
complex LGLSXP
logical STRSXP
character VECSXP
list (generic vector) LISTSXP
pairlist DOTSXP
a ‘…’ object NILSXP
NULL SYMSXP
name/symbol CLOSXP
function or function closure ENVSXP
environment
Among the important internal SEXPTYPE
s are LANGSXP
, CHARSXP
, PROMSXP
, etc. (N.B.: although it is possible to return objects of internal types, it is unsafe to do so as assumptions are made about how they are handled which may be violated at user-level evaluation.) More details are given in R Internal Structures in R Internals.
Unless you are very sure about the type of the arguments, the code should check the data types. Sometimes it may also be necessary to check data types of objects created by evaluating an R expression in the C code. You can use functions like isReal
, isInteger
and isString
to do type checking. See the header file Rinternals.h for definitions of other such functions. All of these take a SEXP
as argument and return 1 or 0 to indicate TRUE or FALSE.
What happens if the SEXP
is not of the correct type? Sometimes you have no other option except to generate an error. You can use the function error
for this. It is usually better to coerce the object to the correct type. For example, if you find that an SEXP
is of the type INTEGER
, but you need a REAL
object, you can change the type by using
newSexp = PROTECT(coerceVector(oldSexp, REALSXP));
Protection is needed as a new object is created; the object formerly pointed to by the SEXP
is still protected but now unused.133
All the coercion functions do their own error-checking, and generate NA
s with a warning or stop with an error as appropriate.
Note that these coercion functions are not the same as calling as.numeric
(and so on) in R code, as they do not dispatch on the class of the object. Thus it is normally preferable to do the coercion in the calling R code.
So far we have only seen how to create and coerce R objects from C code, and how to extract the numeric data from numeric R vectors. These can suffice to take us a long way in interfacing R objects to numerical algorithms, but we may need to know a little more to create useful return objects.
5.9.4 Attributes
Many R objects have attributes: some of the most useful are classes and the dim
and dimnames
that mark objects as matrices or arrays. It can also be helpful to work with the names
attribute of vectors.
To illustrate this, let us write code to take the outer product of two vectors (which outer
and %o%
already do). As usual the R code is simple
out <- function(x, y)
{
storage.mode(x) <- storage.mode(y) <- "double"
.Call("out", x, y)
}
where we expect x
and y
to be numeric vectors (possibly integer), possibly with names. This time we do the coercion in the calling R code.
C code to do the computations is
#include <R.h>
#include <Rinternals.h>
SEXP out(SEXP x, SEXP y)
{
int nx = length(x), ny = length(y);
SEXP ans = PROTECT(allocMatrix(REALSXP, nx, ny));
double *rx = REAL(x), *ry = REAL(y), *rans = REAL(ans);
for(int i = 0; i < nx; i++) {
double tmp = rx[i];
for(int j = 0; j < ny; j++)
rans[i + nx*j] = tmp * ry[j];
}
UNPROTECT(1);
return ans;
}
Note the way REAL
is used: as it is a function call it can be considerably faster to store the result and index that.
However, we would like to set the dimnames
of the result. We can use
#include <R.h>
#include <Rinternals.h>
SEXP out(SEXP x, SEXP y)
{
int nx = length(x), ny = length(y);
SEXP ans = PROTECT(allocMatrix(REALSXP, nx, ny));
double *rx = REAL(x), *ry = REAL(y), *rans = REAL(ans);
for(int i = 0; i < nx; i++) {
double tmp = rx[i];
for(int j = 0; j < ny; j++)
rans[i + nx*j] = tmp * ry[j];
}
SEXP dimnames = PROTECT(allocVector(VECSXP, 2));
SET_VECTOR_ELT(dimnames, 0, getAttrib(x, R_NamesSymbol));
SET_VECTOR_ELT(dimnames, 1, getAttrib(y, R_NamesSymbol));
setAttrib(ans, R_DimNamesSymbol, dimnames);
UNPROTECT(2);
return ans;
}
This example introduces several new features. The getAttrib
and setAttrib
functions get and set individual attributes. Their second argument is a SEXP
defining the name in the symbol table of the attribute we want; these and many such symbols are defined in the header file Rinternals.h.
There are shortcuts here too: the functions namesgets
, dimgets
and dimnamesgets
are the internal versions of the default methods of names<-
, dim<-
and dimnames<-
(for vectors and arrays), and there are functions such as GetMatrixDimnames
and GetArrayDimnames
.
What happens if we want to add an attribute that is not pre-defined? We need to add a symbol for it via a call to install
. Suppose for illustration we wanted to add an attribute “version”
with value 3.0
. We could use
SEXP version;
version = PROTECT(allocVector(REALSXP, 1));
REAL(version)[0] = 3.0;
setAttrib(ans, install("version"), version);
UNPROTECT(1);
Using install
when it is not needed is harmless and provides a simple way to retrieve the symbol from the symbol table if it is already installed. However, the lookup takes a non-trivial amount of time, so consider code such as
static SEXP VerSymbol = NULL;
...
if (VerSymbol == NULL) VerSymbol = install("version");
if it is to be done frequently.
This example can be simplified by another convenience function:
SEXP version = PROTECT(ScalarReal(3.0));
setAttrib(ans, install("version"), version);
UNPROTECT(1);
5.9.5 Classes
In R the class is just the attribute named “class”
so it can be handled as such, but there is a shortcut classgets
. Suppose we want to give the return value in our example the class “mat”
. We can use
#include <R.h>
#include <Rinternals.h>
....
SEXP ans, dim, dimnames, class;
....
class = PROTECT(allocVector(STRSXP, 1));
SET_STRING_ELT(class, 0, mkChar("mat"));
classgets(ans, class);
UNPROTECT(4);
return ans;
}
As the value is a character vector, we have to know how to create that from a C character array, which we do using the function mkChar
.
5.9.6 Handling lists
Some care is needed with lists, as R moved early on from using LISP-like lists (now called “pairlists”) to S-like generic vectors. As a result, the appropriate test for an object of mode list
is isNewList
, and we need allocVector(VECSXP, n
) and not allocList(n)
.
List elements can be retrieved or set by direct access to the elements of the generic vector. Suppose we have a list object
a <- list(f = 1, g = 2, h = 3)
Then we can access a$g
as a[[2]]
by
double g;
....
g = REAL(VECTOR_ELT(a, 1))[0];
This can rapidly become tedious, and the following function (based on one in package stats) is very useful:
/* get the list element named str, or return NULL */
SEXP getListElement(SEXP list, const char *str)
{
SEXP elmt = R_NilValue, names = getAttrib(list, R_NamesSymbol);
for (int i = 0; i < length(list); i++)
if(strcmp(CHAR(STRING_ELT(names, i)), str) == 0) {
elmt = VECTOR_ELT(list, i);
break;
}
return elmt;
}
and enables us to say
double g;
g = REAL(getListElement(a, "g"))[0];
5.9.7 Handling character data
R character vectors are stored as STRSXP
s, a vector type like VECSXP
where every element is of type CHARSXP
. The CHARSXP
elements of STRSXP
s are accessed using STRING_ELT
and SET_STRING_ELT
.
CHARSXP
s are read-only objects and must never be modified. In particular, the C-style string contained in a CHARSXP
should be treated as read-only and for this reason the CHAR
function used to access the character data of a CHARSXP
returns (const char *)
(this also allows compilers to issue warnings about improper use). Since CHARSXP
s are immutable, the same CHARSXP
can be shared by any STRSXP
needing an element representing the same string. R maintains a global cache of CHARSXP
s so that there is only ever one CHARSXP
representing a given string in memory.
You can obtain a CHARSXP
by calling mkChar
and providing a nul-terminated C-style string. This function will return a pre-existing CHARSXP
if one with a matching string already exists, otherwise it will create a new one and add it to the cache before returning it to you. The variant mkCharLen
can be used to create a CHARSXP
from part of a buffer and will ensure null-termination.
Note that R character strings are restricted to 2^31 - 1
bytes, and hence so should the input to mkChar
be (C allows longer strings on 64-bit platforms).
5.9.8 Finding and setting variables
It will be usual that all the R objects needed in our C computations are passed as arguments to .Call
or .External
, but it is possible to find the values of R objects from within the C given their names. The following code is the equivalent of get(name, envir = rho)
.
SEXP getvar(SEXP name, SEXP rho)
{
SEXP ans;
if(!isString(name) || length(name) != 1)
error("name is not a single string");
if(!isEnvironment(rho))
error("rho should be an environment");
ans = findVar(installChar(STRING_ELT(name, 0)), rho);
Rprintf("first value is %f\n", REAL(ans)[0]);
return R_NilValue;
}
The main work is done by findVar
, but to use it we need to install name
as a name in the symbol table. As we wanted the value for internal use, we return NULL
.
Similar functions with syntax
void defineVar(SEXP symbol, SEXP value, SEXP rho)
void setVar(SEXP symbol, SEXP value, SEXP rho)
can be used to assign values to R variables. defineVar
creates a new binding or changes the value of an existing binding in the specified environment frame; it is the analogue of assign(symbol, value, envir = rho, inherits = FALSE)
, but unlike assign
, defineVar
does not make a copy of the object value
.134 setVar
searches for an existing binding for symbol
in rho
or its enclosing environments. If a binding is found, its value is changed to value
. Otherwise, a new binding with the specified value is created in the global environment. This corresponds to assign(symbol, value, envir = rho, inherits = TRUE)
.
5.9.9 Some convenience functions
Some operations are done so frequently that there are convenience functions to handle them. (All these are provided via the header file Rinternals.h.)
Suppose we wanted to pass a single logical argument ignore_quotes
: we could use
int ign = asLogical(ignore_quotes);
if(ign == NA_LOGICAL) error("'ignore_quotes' must be TRUE or FALSE");
which will do any coercion needed (at least from a vector argument), and return NA_LOGICAL
if the value passed was NA
or coercion failed. There are also asInteger
, asReal
and asComplex
. The function asChar
returns a CHARSXP
. All of these functions ignore any elements of an input vector after the first.
To return a length-one real vector we can use
double x;
...
return ScalarReal(x);
and there are versions of this for all the atomic vector types (those for a length-one character vector being ScalarString
with argument a CHARSXP
and mkString
with argument const char *
).
Some of the isXXXX
functions differ from their apparent R-level counterparts: for example isVector
is true for any atomic vector type (isVectorAtomic
) and for lists and expressions (isVectorList
) (with no check on attributes). isMatrix
is a test of a length-2 “dim”
attribute.
There are a series of small macros/functions to help construct pairlists and language objects (whose internal structures just differ by SEXPTYPE
). Function CONS(u, v)
is the basic building block: it constructs a pairlist from u
followed by v
(which is a pairlist or R_NilValue
). LCONS
is a variant that constructs a language object. Functions list1
to list6
construct a pairlist from one to six items, and lang1
to lang6
do the same for a language object (a function to call plus zero to five arguments). Functions elt
and lastElt
find the ith element and the last element of a pairlist, and nthcdr
returns a pointer to the nth position in the pairlist (whose CAR
is the nth item).
Functions str2type
and type2str
map R length-one character strings to and from SEXPTYPE
numbers, and type2char
maps numbers to C character strings.
5.9.9.1 Semi-internal convenience functions
There is quite a collection of functions that may be used in your C code if you are willing to adapt to rare “API” changes. These typically contain “workhorses” of their R counterparts.
Functions any_duplicated
and any_duplicated3
are fast versions of R’s any(duplicated(.))
.
Function R_compute_identical
corresponds to R’s identical
function.
5.9.10 Named objects and copying
When assignments are done in R such as
x <- 1:10
y <- x
the named object is not necessarily copied, so after those two assignments y
and x
are bound to the same SEXPREC
(the structure a SEXP
points to). This means that any code which alters one of them has to make a copy before modifying the copy if the usual R semantics are to apply. Note that whereas .C
and .Fortran
do copy their arguments (unless the dangerous dup = FALSE
is used), .Call
and .External
do not. So duplicate
is commonly called on arguments to .Call
before modifying them.
However, at least some of this copying is unneeded. In the first assignment shown, x <- 1:10
, R first creates an object with value 1:10
and then assigns it to x
but if x
is modified no copy is necessary as the temporary object with value 1:10
cannot be referred to again. R distinguishes between named and unnamed objects via a field in a SEXPREC
that can be accessed via the macros NAMED
and SET_NAMED
. This can take values
-
0
-
The object is not bound to any symbol
-
1
-
The object has been bound to exactly one symbol
-
2
-
The object has potentially been bound to two or more symbols, and one should act as if another variable is currently bound to this value.
Note the past tenses: R does not do full reference counting and there may currently be fewer bindings.
It is safe to modify the value of any SEXP
for which NAMED(foo)
is zero, and if NAMED(foo)
is two, the value should be duplicated (via a call to duplicate
) before any modification. Note that it is the responsibility of the author of the code making the modification to do the duplication, even if it is x
whose value is being modified after y <- x
.
The case NAMED(foo) == 1
allows some optimization, but it can be ignored (and duplication done whenever NAMED(foo) > 0
). (This optimization is not currently usable in user code.) It is intended for use within replacement functions. Suppose we used
x <- 1:10
foo(x) <- 3
which is computed as
x <- 1:10
x <- "foo<-"(x, 3)
Then inside “foo<-”
the object pointing to the current value of x
will have NAMED(foo)
as one, and it would be safe to modify it as the only symbol bound to it is x
and that will be rebound immediately. (Provided the remaining code in “foo<-”
make no reference to x
, and no one is going to attempt a direct call such as y <- “foo<-”(x)
.)
This mechanism is likely to be replaced in future versions of R.
5.10 Interface functions .Call
and .External
In this section we consider the details of the R/C interfaces.
These two interfaces have almost the same functionality. .Call
is based on the interface of the same name in S version 4, and .External
is based on R’s .Internal
. .External
is more complex but allows a variable number of arguments.
5.10.1 Calling .Call
Let us convert our finite convolution example to use .Call
. The calling function in R is
conv <- function(a, b) .Call("convolve2", a, b)
which could hardly be simpler, but as we shall see all the type coercion is transferred to the C code, which is
#include <R.h>
#include <Rinternals.h>
SEXP convolve2(SEXP a, SEXP b)
{
int na, nb, nab;
double *xa, *xb, *xab;
SEXP ab;
a = PROTECT(coerceVector(a, REALSXP));
b = PROTECT(coerceVector(b, REALSXP));
na = length(a); nb = length(b); nab = na + nb - 1;
ab = PROTECT(allocVector(REALSXP, nab));
xa = REAL(a); xb = REAL(b); xab = REAL(ab);
for(int i = 0; i < nab; i++) xab[i] = 0.0;
for(int i = 0; i < na; i++)
for(int j = 0; j < nb; j++) xab[i + j] += xa[i] * xb[j];
UNPROTECT(3);
return ab;
}
5.10.2 Calling .External
We can use the same example to illustrate .External
. The R code changes only by replacing .Call
by .External
conv <- function(a, b) .External("convolveE", a, b)
but the main change is how the arguments are passed to the C code, this time as a single SEXP. The only change to the C code is how we handle the arguments.
#include <R.h>
#include <Rinternals.h>
SEXP convolveE(SEXP args)
{
int i, j, na, nb, nab;
double *xa, *xb, *xab;
SEXP a, b, ab;
a = PROTECT(coerceVector(CADR(args), REALSXP));
b = PROTECT(coerceVector(CADDR(args), REALSXP));
...
}
Once again we do not need to protect the arguments, as in the R side of the interface they are objects that are already in use. The macros
first = CADR(args);
second = CADDR(args);
third = CADDDR(args);
fourth = CAD4R(args);
provide convenient ways to access the first four arguments. More generally we can use the CDR
and CAR
macros as in
args = CDR(args); a = CAR(args);
args = CDR(args); b = CAR(args);
which clearly allows us to extract an unlimited number of arguments (whereas .Call
has a limit, albeit at 65 not a small one).
More usefully, the .External
interface provides an easy way to handle calls with a variable number of arguments, as length(args)
will give the number of arguments supplied (of which the first is ignored). We may need to know the names (‘tags’) given to the actual arguments, which we can by using the TAG
macro and using something like the following example, that prints the names and the first value of its arguments if they are vector types.
SEXP showArgs(SEXP args)
{
args = CDR(args); /* skip ‘name’ */
for(int i = 0; args != R_NilValue; i++, args = CDR(args)) {
const char *name =
isNull(TAG(args)) ? "" : CHAR(PRINTNAME(TAG(args)));
SEXP el = CAR(args);
if (length(el) == 0) {
Rprintf("[%d] ‘%s’ R type, length 0\n", i+1, name);
continue;
}
switch(TYPEOF(el)) {
case REALSXP:
Rprintf("[%d] ‘%s’ %f\n", i+1, name, REAL(el)[0]);
break;
case LGLSXP:
case INTSXP:
Rprintf("[%d] ‘%s’ %d\n", i+1, name, INTEGER(el)[0]);
break;
case CPLXSXP:
{
Rcomplex cpl = COMPLEX(el)[0];
Rprintf("[%d] ‘%s’ %f + %fi\n", i+1, name, cpl.r, cpl.i);
}
break;
case STRSXP:
Rprintf("[%d] ‘%s’ %s\n", i+1, name,
CHAR(STRING_ELT(el, 0)));
break;
default:
Rprintf("[%d] ‘%s’ R type\n", i+1, name);
}
}
return R_NilValue;
}
This can be called by the wrapper function
showArgs <- function(...) invisible(.External("showArgs", ...))
Note that this style of programming is convenient but not necessary, as an alternative style is
showArgs1 <- function(...) invisible(.Call("showArgs1", list(...)))
The (very similar) C code is in the scripts.
5.10.3 Missing and special values
One piece of error-checking the .C
call does (unless NAOK
is true) is to check for missing (NA
) and IEEE special values (Inf
, -Inf
and NaN
) and give an error if any are found. With the .Call
interface these will be passed to our code. In this example the special values are no problem, as IEC60559 arithmetic will handle them correctly. In the current implementation this is also true of NA
as it is a type of NaN
, but it is unwise to rely on such details. Thus we will re-write the code to handle NA
s using macros defined in R_ext/Arith.h included by R.h.
The code changes are the same in any of the versions of convolve2
or convolveE
:
...
for(int i = 0; i < na; i++)
for(int j = 0; j < nb; j++)
if(ISNA(xa[i]) || ISNA(xb[j]) || ISNA(xab[i + j]))
xab[i + j] = NA_REAL;
else
xab[i + j] += xa[i] * xb[j];
...
Note that the ISNA
macro, and the similar macros ISNAN
(which checks for NaN
or NA
) and R_FINITE
(which is false for NA
and all the special values), only apply to numeric values of type double
. Missingness of integers, logicals and character strings can be tested by equality to the constants NA_INTEGER
, NA_LOGICAL
and NA_STRING
. These and NA_REAL
can be used to set elements of R vectors to NA
.
The constants R_NaN
, R_PosInf
and R_NegInf
can be used to set double
s to the special values.
5.11 Evaluating R expressions from C
The main function we will use is
SEXP eval(SEXP expr, SEXP rho);
the equivalent of the interpreted R code eval(expr, envir = rho)
(so rho
must be an environment), although we can also make use of findVar
, defineVar
and findFun
(which restricts the search to functions).
To see how this might be applied, here is a simplified internal version of lapply
for expressions, used as
a <- list(a = 1:5, b = rnorm(10), test = runif(100))
.Call("lapply", a, quote(sum(x)), new.env())
with C code
SEXP lapply(SEXP list, SEXP expr, SEXP rho)
{
int n = length(list);
SEXP ans;
if(!isNewList(list)) error("'list' must be a list");
if(!isEnvironment(rho)) error("'rho' should be an environment");
ans = PROTECT(allocVector(VECSXP, n));
for(int i = 0; i < n; i++) {
defineVar(install("x"), VECTOR_ELT(list, i), rho);
SET_VECTOR_ELT(ans, i, eval(expr, rho));
}
setAttrib(ans, R_NamesSymbol, getAttrib(list, R_NamesSymbol));
UNPROTECT(1);
return ans;
}
It would be closer to lapply
if we could pass in a function rather than an expression. One way to do this is via interpreted R code as in the next example, but it is possible (if somewhat obscure) to do this in C code. The following is based on the code in src/main/optimize.c.
SEXP lapply2(SEXP list, SEXP fn, SEXP rho)
{
int n = length(list);
SEXP R_fcall, ans;
if(!isNewList(list)) error("'list' must be a list");
if(!isFunction(fn)) error("'fn' must be a function");
if(!isEnvironment(rho)) error("'rho' should be an environment");
R_fcall = PROTECT(lang2(fn, R_NilValue));
ans = PROTECT(allocVector(VECSXP, n));
for(int i = 0; i < n; i++) {
SETCADR(R_fcall, VECTOR_ELT(list, i));
SET_VECTOR_ELT(ans, i, eval(R_fcall, rho));
}
setAttrib(ans, R_NamesSymbol, getAttrib(list, R_NamesSymbol));
UNPROTECT(2);
return ans;
}
used by
.Call("lapply2", a, sum, new.env())
Function lang2
creates an executable pairlist of two elements, but this will only be clear to those with a knowledge of a LISP-like language.
As a more comprehensive example of constructing an R call in C code and evaluating, consider the following fragment of printAttributes
in src/main/print.c.
/* Need to construct a call to
print(CAR(a), digits=digits)
based on the R_print structure, then eval(call, env).
See do_docall for the template for this sort of thing.
*/
SEXP s, t;
t = s = PROTECT(allocList(3));
SET_TYPEOF(s, LANGSXP);
SETCAR(t, install("print")); t = CDR(t);
SETCAR(t, CAR(a)); t = CDR(t);
SETCAR(t, ScalarInteger(digits));
SET_TAG(t, install("digits"));
eval(s, env);
UNPROTECT(1);
At this point CAR(a)
is the R object to be printed, the current attribute. There are three steps: the call is constructed as a pairlist of length 3, the list is filled in, and the expression represented by the pairlist is evaluated.
A pairlist is quite distinct from a generic vector list, the only user-visible form of list in R. A pairlist is a linked list (with CDR(t)
computing the next entry), with items (accessed by CAR(t)
) and names or tags (set by SET_TAG
). In this call there are to be three items, a symbol (pointing to the function to be called) and two argument values, the first unnamed and the second named. Setting the type to LANGSXP
makes this a call which can be evaluated.
5.11.1 Zero-finding
In this section we re-work the example of Becker, Chambers & Wilks (1988, pp.~205–10) on finding a zero of a univariate function. The R code and an example are
zero <- function(f, guesses, tol = 1e-7) {
f.check <- function(x) {
x <- f(x)
if(!is.numeric(x)) stop("Need a numeric result")
as.double(x)
}
.Call("zero", body(f.check), as.double(guesses), as.double(tol),
new.env())
}
cube1 <- function(x) (x^2 + 1) * (x - 1.5)
zero(cube1, c(0, 5))
where this time we do the coercion and error-checking in the R code. The C code is
SEXP mkans(double x)
{
// no need for PROTECT() here, as REAL(.) does not allocate:
SEXP ans = allocVector(REALSXP, 1);
REAL(ans)[0] = x;
return ans;
}
double feval(double x, SEXP f, SEXP rho)
{
// a version with (too) much PROTECT()ion .. "better safe than sorry"
SEXP symbol, value;
PROTECT(symbol = install("x"));
PROTECT(value = mkans(x));
defineVar(symbol, value, rho);
UNPROTECT(2);
return(REAL(eval(f, rho))[0]);
}
SEXP zero(SEXP f, SEXP guesses, SEXP stol, SEXP rho)
{
double x0 = REAL(guesses)[0], x1 = REAL(guesses)[1],
tol = REAL(stol)[0];
double f0, f1, fc, xc;
if(tol <= 0.0) error("non-positive tol value");
f0 = feval(x0, f, rho); f1 = feval(x1, f, rho);
if(f0 == 0.0) return mkans(x0);
if(f1 == 0.0) return mkans(x1);
if(f0*f1 > 0.0) error("x[0] and x[1] have the same sign");
for(;;) {
xc = 0.5*(x0+x1);
if(fabs(x0-x1) < tol) return mkans(xc);
fc = feval(xc, f, rho);
if(fc == 0) return mkans(xc);
if(f0*fc > 0.0) {
x0 = xc; f0 = fc;
} else {
x1 = xc; f1 = fc;
}
}
}
5.11.2 Calculating numerical derivatives
We will use a longer example (by Saikat DebRoy) to illustrate the use of evaluation and .External
. This calculates numerical derivatives, something that could be done as effectively in interpreted R code but may be needed as part of a larger C calculation.
An interpreted R version and an example are
numeric.deriv <- function(expr, theta, rho=sys.frame(sys.parent()))
{
eps <- sqrt(.Machine$double.eps)
ans <- eval(substitute(expr), rho)
grad <- matrix(, length(ans), length(theta),
dimnames=list(NULL, theta))
for (i in seq_along(theta)) {
old <- get(theta[i], envir=rho)
delta <- eps * max(1, abs(old))
assign(theta[i], old+delta, envir=rho)
ans1 <- eval(substitute(expr), rho)
assign(theta[i], old, envir=rho)
grad[, i] <- (ans1 - ans)/delta
}
attr(ans, "gradient") <- grad
ans
}
omega <- 1:5; x <- 1; y <- 2
numeric.deriv(sin(omega*x*y), c("x", "y"))
where expr
is an expression, theta
a character vector of variable names and rho
the environment to be used.
For the compiled version the call from R will be
.External("numeric_deriv", expr, theta, rho)
with example usage
.External("numeric_deriv", quote(sin(omega*x*y)),
c("x", "y"), .GlobalEnv)
Note the need to quote the expression to stop it being evaluated in the caller.
Here is the complete C code which we will explain section by section.
#include <R.h> /* for DOUBLE_EPS */
#include <Rinternals.h>
SEXP numeric_deriv(SEXP args)
{
SEXP theta, expr, rho, ans, ans1, gradient, par, dimnames;
double tt, xx, delta, eps = sqrt(DOUBLE_EPS), *rgr, *rans;
int i, start;
expr = CADR(args);
if(!isString(theta = CADDR(args)))
error("theta should be of type character");
if(!isEnvironment(rho = CADDDR(args)))
error("rho should be an environment");
ans = PROTECT(coerceVector(eval(expr, rho), REALSXP));
gradient = PROTECT(allocMatrix(REALSXP, LENGTH(ans), LENGTH(theta)));
rgr = REAL(gradient); rans = REAL(ans);
for(i = 0, start = 0; i < LENGTH(theta); i++, start += LENGTH(ans)) {
par = PROTECT(findVar(installChar(STRING_ELT(theta, i)), rho));
tt = REAL(par)[0];
xx = fabs(tt);
delta = (xx < 1) ? eps : xx*eps;
REAL(par)[0] += delta;
ans1 = PROTECT(coerceVector(eval(expr, rho), REALSXP));
for(int j = 0; j < LENGTH(ans); j++)
rgr[j + start] = (REAL(ans1)[j] - rans[j])/delta;
REAL(par)[0] = tt;
UNPROTECT(2); /* par, ans1 */
}
dimnames = PROTECT(allocVector(VECSXP, 2));
SET_VECTOR_ELT(dimnames, 1, theta);
dimnamesgets(gradient, dimnames);
setAttrib(ans, install("gradient"), gradient);
UNPROTECT(3); /* ans gradient dimnames */
return ans;
}
The code to handle the arguments is
expr = CADR(args);
if(!isString(theta = CADDR(args)))
error("theta should be of type character");
if(!isEnvironment(rho = CADDDR(args)))
error("rho should be an environment");
Note that we check for correct types of theta
and rho
but do not check the type of expr
. That is because eval
can handle many types of R objects other than EXPRSXP
. There is no useful coercion we can do, so we stop with an error message if the arguments are not of the correct mode.
The first step in the code is to evaluate the expression in the environment rho
, by
ans = PROTECT(coerceVector(eval(expr, rho), REALSXP));
We then allocate space for the calculated derivative by
gradient = PROTECT(allocMatrix(REALSXP, LENGTH(ans), LENGTH(theta)));
The first argument to allocMatrix
gives the SEXPTYPE
of the matrix: here we want it to be REALSXP
. The other two arguments are the numbers of rows and columns. (Note that LENGTH
is intended to be used for vectors: length
is more generally applicable.)
for(i = 0, start = 0; i < LENGTH(theta); i++, start += LENGTH(ans)) {
par = PROTECT(findVar(installChar(STRING_ELT(theta, i)), rho));
Here, we are entering a for loop. We loop through each of the variables. In the for
loop, we first create a symbol corresponding to the i
’th element of the STRSXP
theta
. Here, STRING_ELT(theta, i)
accesses the i
’th element of the STRSXP
theta
. Macro CHAR()
extracts the actual character representation135 of it: it returns a pointer. We then install the name and use findVar
to find its value.
tt = REAL(par)[0];
xx = fabs(tt);
delta = (xx < 1) ? eps : xx*eps;
REAL(par)[0] += delta;
ans1 = PROTECT(coerceVector(eval(expr, rho), REALSXP));
We first extract the real value of the parameter, then calculate delta
, the increment to be used for approximating the numerical derivative. Then we change the value stored in par
(in environment rho
) by delta
and evaluate expr
in environment rho
again. Because we are directly dealing with original R memory locations here, R does the evaluation for the changed parameter value.
for(int j = 0; j < LENGTH(ans); j++)
rgr[j + start] = (REAL(ans1)[j] - rans[j])/delta;
REAL(par)[0] = tt;
UNPROTECT(2);
}
Now, we compute the i
’th column of the gradient matrix. Note how it is accessed: R stores matrices by column (like FORTRAN).
dimnames = PROTECT(allocVector(VECSXP, 2));
SET_VECTOR_ELT(dimnames, 1, theta);
dimnamesgets(gradient, dimnames);
setAttrib(ans, install("gradient"), gradient);
UNPROTECT(3);
return ans;
}
First we add column names to the gradient matrix. This is done by allocating a list (a VECSXP
) whose first element, the row names, is NULL
(the default) and the second element, the column names, is set as theta
. This list is then assigned as the attribute having the symbol R_DimNamesSymbol
. Finally we set the gradient matrix as the gradient attribute of ans
, unprotect the remaining protected locations and return the answer ans
.
5.12 Parsing R code from C
Suppose an R extension want to accept an R expression from the user and evaluate it. The previous section covered evaluation, but the expression will be entered as text and needs to be parsed first. A small part of R’s parse interface is declared in header file R_ext/Parse.h136.
An example of the usage can be found in the (example) Windows package windlgs included in the R source tree. The essential part is
#include <R.h>
#include <Rinternals.h>
#include <R_ext/Parse.h>
SEXP menu_ttest3()
{
char cmd[256];
SEXP cmdSexp, cmdexpr, ans = R_NilValue;
ParseStatus status;
...
if(done == 1) {
cmdSexp = PROTECT(allocVector(STRSXP, 1));
SET_STRING_ELT(cmdSexp, 0, mkChar(cmd));
cmdexpr = PROTECT(R_ParseVector(cmdSexp, -1, &status, R_NilValue));
if (status != PARSE_OK) {
UNPROTECT(2);
error("invalid call %s", cmd);
}
/* Loop is needed here as EXPSEXP will be of length > 1 */
for(int i = 0; i < length(cmdexpr); i++)
ans = eval(VECTOR_ELT(cmdexpr, i), R_GlobalEnv);
UNPROTECT(2);
}
return ans;
}
Note that a single line of text may give rise to more than one R expression.
R_ParseVector
is essentially the code used to implement parse(text=)
at R level. The first argument is a character vector (corresponding to text
) and the second the maximal number of expressions to parse (corresponding to n
). The third argument is a pointer to a variable of an enumeration type, and it is normal (as parse
does) to regard all values other than PARSE_OK
as an error. Other values which might be returned are PARSE_INCOMPLETE
(an incomplete expression was found) and PARSE_ERROR
(a syntax error), in both cases the value returned being R_NilValue
. The fourth argument is a length one character vector to be used as a filename in error messages, a srcfile
object or the R NULL
object (as in the example above). If a srcfile
object was used, a srcref
attribute would be attached to the result, containing a list of srcref
objects of the same length as the expression, to allow it to be echoed with its original formatting.
5.12.1 Accessing source references
The source references added by the parser are recorded by R’s evaluator as it evaluates code. Two functions make these available to debuggers running C code:
SEXP R_GetCurrentSrcref(int skip);
This function checks R_Srcref
and the current evaluation stack for entries that contain source reference information. The skip
argument tells how many source references to skip before returning the SEXP
of the srcref
object, counting from the top of the stack. If skip < 0
, abs(skip)
locations are counted up from the bottom of the stack. If too few or no source references are found, NULL
is returned.
SEXP R_GetSrcFilename(SEXP srcref);
This function extracts the filename from the source reference for display, returning a length 1 character vector containing the filename. If no name is found, “”
is returned.
5.13 External pointers and weak references
The SEXPTYPE
s EXTPTRSXP
and WEAKREFSXP
can be encountered at R level, but are created in C code.
External pointer SEXP
s are intended to handle references to C structures such as ‘handles’, and are used for this purpose in package RODBC for example. They are unusual in their copying semantics in that when an R object is copied, the external pointer object is not duplicated. (For this reason external pointers should only be used as part of an object with normal semantics, for example an attribute or an element of a list.)
An external pointer is created by
SEXP R_MakeExternalPtr(void *p, SEXP tag, SEXP prot);
where p
is the pointer (and hence this cannot portably be a function pointer), and tag
and prot
are references to ordinary R objects which will remain in existence (be protected from garbage collection) for the lifetime of the external pointer object. A useful convention is to use the tag
field for some form of type identification and the prot
field for protecting the memory that the external pointer represents, if that memory is allocated from the R heap. Both tag
and prot
can be R_NilValue
, and often are.
An alternative way as from R 3.4.0 to create an external pointer from a function pointer is
typedef void * (*R_DL_FUNC)();
SEXP R_MakeExternalPtrFn(R_DL_FUNC p, SEXP tag, SEXP prot);
The elements of an external pointer can be accessed and set via
void *R_ExternalPtrAddr(SEXP s);
DL_FUNC R_ExternalPtrAddrFn(SEXP s);
SEXP R_ExternalPtrTag(SEXP s);
SEXP R_ExternalPtrProtected(SEXP s);
void R_ClearExternalPtr(SEXP s);
void R_SetExternalPtrAddr(SEXP s, void *p);
void R_SetExternalPtrTag(SEXP s, SEXP tag);
void R_SetExternalPtrProtected(SEXP s, SEXP p);
Clearing a pointer sets its value to the C NULL
pointer.
An external pointer object can have a finalizer, a piece of code to be run when the object is garbage collected. This can be R code or C code, and the various interfaces are, respectively.
void R_RegisterFinalizerEx(SEXP s, SEXP fun, Rboolean onexit);
typedef void (*R_CFinalizer_t)(SEXP);
void R_RegisterCFinalizerEx(SEXP s, R_CFinalizer_t fun, Rboolean onexit);
The R function indicated by fun
should be a function of a single argument, the object to be finalized. R does not perform a garbage collection when shutting down, and the onexit
argument of the extended forms can be used to ask that the finalizer be run during a normal shutdown of the R session. It is suggested that it is good practice to clear the pointer on finalization.
The only R level function for interacting with external pointers is reg.finalizer
which can be used to set a finalizer.
It is probably not a good idea to allow an external pointer to be save
d and then reloaded, but if this happens the pointer will be set to the C NULL
pointer.
Finalizers can be run at many places in the code base and much of it, including the R interpreter, is not re-entrant. So great care is needed in choosing the code to be run in a finalizer. Finalizers are marked to be run at garbage collection but only run at a somewhat safe point thereafter.
Weak references are used to allow the programmer to maintain information on entities without preventing the garbage collection of the entities once they become unreachable.
A weak reference contains a key and a value. The value is reachable is if it either reachable directly or via weak references with reachable keys. Once a value is determined to be unreachable during garbage collection, the key and value are set to R_NilValue
and the finalizer will be run later in the garbage collection.
Weak reference objects are created by one of
SEXP R_MakeWeakRef(SEXP key, SEXP val, SEXP fin, Rboolean onexit);
SEXP R_MakeWeakRefC(SEXP key, SEXP val, R_CFinalizer_t fin,
Rboolean onexit);
where the R or C finalizer are specified in exactly the same way as for an external pointer object (whose finalization interface is implemented via weak references).
The parts can be accessed via
SEXP R_WeakRefKey(SEXP w);
SEXP R_WeakRefValue(SEXP w);
void R_RunWeakRefFinalizer(SEXP w);
A toy example of the use of weak references can be found at homepage.stat.uiowa.edu/~luke/R/references/weakfinex.html
, but that is used to add finalizers to external pointers which can now be done more directly. At the time of writing no CRAN or Bioconductor package uses weak references.
5.13.1 An example
Package RODBC uses external pointers to maintain its channels, connections to databases. There can be several connections open at once, and the status information for each is stored in a C structure (pointed to by thisHandle
in the code extract below) that is returned via an external pointer as part of the RODBC ‘channel’ (as the “handle_ptr”
attribute). The external pointer is created by
SEXP ans, ptr;
ans = PROTECT(allocVector(INTSXP, 1));
ptr = R_MakeExternalPtr(thisHandle, install("RODBC_channel"), R_NilValue);
PROTECT(ptr);
R_RegisterCFinalizerEx(ptr, chanFinalizer, TRUE);
...
/* return the channel no */
INTEGER(ans)[0] = nChannels;
/* and the connection string as an attribute */
setAttrib(ans, install("connection.string"), constr);
setAttrib(ans, install("handle_ptr"), ptr);
UNPROTECT(3);
return ans;
Note the symbol given to identify the usage of the external pointer, and the use of the finalizer. Since the final argument when registering the finalizer is TRUE
, the finalizer will be run at the end of the R session (unless it crashes). This is used to close and clean up the connection to the database. The finalizer code is simply
static void chanFinalizer(SEXP ptr)
{
if(!R_ExternalPtrAddr(ptr)) return;
inRODBCClose(R_ExternalPtrAddr(ptr));
R_ClearExternalPtr(ptr); /* not really needed */
}
Clearing the pointer and checking for a NULL
pointer avoids any possibility of attempting to close an already-closed channel.
R’s connections provide another example of using external pointers, in that case purely to be able to use a finalizer to close and destroy the connection if it is no longer is use.
5.14 Vector accessor functions
The vector accessors like REAL
and INTEGER
and VECTOR_ELT
are functions when used in R extensions. (For efficiency they are macros when used in the R source code, apart from SET_STRING_ELT
and SET_VECTOR_ELT
which are always functions.)
The accessor functions check that they are being used on an appropriate type of SEXP
.
If efficiency is essential, the macro versions of the accessors can be obtained by defining ‘USE_RINTERNALS’ before including Rinternals.h. If you find it necessary to do so, please do test that your code compiles without ‘USE_RINTERNALS’ defined, as this provides a stricter test that the accessors have been used correctly. Note too that the use of ‘USE_RINTERNALS’ when the header is included in C++ code is not supported: doing so may use C99 features which are not necessarily supported by the C++ compiler. Nor is use with Rdefines.h supported.
5.15 Character encoding issues
CHARSXP
s can be marked as coming from a known encoding (Latin-1 or UTF-8). This is mainly intended for human-readable output, and most packages can just treat such CHARSXP
s as a whole. However, if they need to be interpreted as characters or output at C level then it would normally be correct to ensure that they are converted to the encoding of the current locale: this can be done by accessing the data in the CHARSXP
by translateChar
rather than by CHAR
. If re-encoding is needed this allocates memory with R_alloc
which thus persists to the end of the .Call
/.External
call unless vmaxset
is used (see Transient storage allocation).
There is a similar function translateCharUTF8
which converts to UTF-8: this has the advantage that a faithful translation is almost always possible (whereas only a few languages can be represented in the encoding of the current locale unless that is UTF-8).
There is a public interface to the encoding marked on CHARXSXPs
via
typedef enum {CE_NATIVE, CE_UTF8, CE_LATIN1, CE_SYMBOL, CE_ANY} cetype_t;
cetype_t getCharCE(SEXP);
SEXP mkCharCE(const char *, cetype_t);
Only CE_UTF8
and CE_LATIN1
are marked on CHARSXPs
(and so Rf_getCharCE
will only return one of the first three), and these should only be used on non-ASCII strings. Value CE_SYMBOL
is used internally to indicate Adobe Symbol encoding. Value CE_ANY
is used to indicate a character string that will not need re-encoding – this is used for character strings known to be in ASCII, and can also be used as an input parameter where the intention is that the string is treated as a series of bytes. (See the comments under mkChar
about the length of input allowed.)
Function
const char *reEnc(const char *x, cetype_t ce_in, cetype_t ce_out,
int subst);
can be used to re-encode character strings: like translateChar
it returns a string allocated by R_alloc
. This can translate from CE_SYMBOL
to CE_UTF8
, but not conversely. Argument subst
controls what to do with untranslatable characters or invalid input: this is done byte-by-byte with 1
indicates to output hex of the form <a0>
, and 2
to replace by .
, with any other value causing the byte to produce no output.
There is also
SEXP mkCharLenCE(const char *, size_t, cetype_t);
to create marked character strings of a given length.