Chapter 12 Current and future directions
This chapter is for notes about possible in-progress and future changes to R: there is no commitment to release such changes, let alone to a timescale.
12.1 Long vectors
Vectors in R 2.x.y were limited to a length of 2^31 - 1 elements (about 2 billion), as the length is stored in the SEXPREC
as a C int
, and that type is used extensively to record lengths and element numbers, including in packages.
Note that longer vectors are effectively impossible under 32-bit platforms because of their address limit, so this section applies only on 64-bit platforms. The internals are unchanged on a 32-bit build of R.
A single object with 2^31 or more elements will take up at least 8GB of memory if integer or logical and 16GB if numeric or character, so routine use of such objects is still some way off.
There is now some support for long vectors. This applies to raw, logical, integer, numeric and character vectors, and lists and expression vectors. (Elements of character vectors (CHARSXP
s) remain limited to 2^31 - 1 bytes.) Some considerations:
-
This has been implemented by recording the length (and true length) as
-1
and recording the actual length as a 64-bit field at the beginning of the header. Because a fair amount of code in R uses a signed type for the length, the ‘long length’ is recorded using the signed C99 typeptrdiff_t
, which is typedef-ed toR_xlen_t
. - These can in theory have 63-bit lengths, but note that current 64-bit OSes do not even theoretically offer 64-bit address spaces and there is currently a 52-bit limit (which exceeds the theoretical limit of current OSes and ensures that such lengths can be stored exactly in doubles).
-
The serialization format has been changed to accommodate longer lengths, but vectors of lengths up to 2^31-1 are stored in the same way as before. Longer vectors have their length field set to
-1
and followed by two 32-bit fields giving the upper and lower 32-bits of the actual length. There is currently a sanity check which limits lengths to 2^48 on unserialization. -
The type
R_xlen_t
is made available to packages in C header Rinternals.h: this should be fine in C code since C99 is required. People do try to use R internals in C++, but C++98 compilers are not required to support these types. -
Indexing can be done via the use of doubles. The internal indexing code used to work with positive integer indices (and negative, logical and matrix indices were all converted to positive integers): it now works with either
INTSXP
orREALSXP
indices. -
R function
length
was documented to currently return an integer, possiblyNA
. A lot of code has been written that assumes that, and even code which callsas.integer(length(x))
before passing to.C
/.Fortran
rarely checks for anNA
result.There is a new function
xlength
which works for long vectors and returns a double value if the length exceeds 2^31-1. At presentlength
returnsNA
for long vectors, but it may be safer to make that an error.
12.2 64-bit types
There is also some desire to be able to store larger integers in R, although the possibility of storing these as double
is often overlooked (and e.g. file pointers as returned by seek
are already stored as double
).
Different routes have been proposed:
-
Add a new type to R and use that for lengths and indices—most likely this would be a 64-bit signed type, say
longint
. R’s usual implicit coercion rules would ensure that supplying aninteger
vector for indexing orlength<-
would work. -
A more radical alternative is to change the existing
integer
type to be 64-bit on 64-bit platforms (which was the approach taken by S-PLUS for DEC/Compaq Alpha systems). Or even on all platforms. -
Allow either
integer
ordouble
values for lengths and indices, and returndouble
only when necessary.
The third has the advantages of minimal disruption to existing code and not increasing memory requirements. In the first and third scenarios both R’s own code and user code would have to be adapted for lengths that were not of type integer
, and in the third code branches for long vectors would be tested rarely.
Most users of the .C
and .Fortran
interfaces use as.integer
for lengths and element numbers, but a few omit these in the knowledge that these were of type integer
. It may be reasonable to assume that these are never intended to be used with long vectors.
The remaining interfaces will need to cope with the changed VECTOR_SEXPREC
types. It seems likely that in most cases lengths are accessed by the length
and LENGTH
functions27 The current approach is to keep these returning 32-bit lengths and introduce ‘long’ versions xlength
and XLENGTH
which return R_xlen_t
values.
See also http://homepage.cs.uiowa.edu/~luke/talks/useR10.pdf.
12.3 Large matrices
Matrices are stored as vectors and so were also limited to 2^31-1 elements. Now longer vectors are allowed on 64-bit platforms, matrices with more elements are supported provided that each of the dimensions is no more than 2^31-1. However, not all applications can be supported.
The main problem is linear algebra done by FORTRAN code compiled with 32-bit INTEGER
. Although not guaranteed, it seems that all the compilers currently used with R on a 64-bit platform allow matrices each of whose dimensions is less than 2^31 but with more than 2^31 elements, and index them correctly, and a substantial part of the support software (such as BLAS and LAPACK) also work.
There are exceptions: for example some complex LAPACK auxiliary routines do use a single INTEGER
index and hence overflow silently and segfault or give incorrect results. One example is svd()
on a complex matrix.
Since this is implementation-dependent, it is possible that optimized BLAS and LAPACK may have further restrictions, although none have yet been encountered. For matrix algebra on large matrices one almost certainly wants a machine with a lot of RAM (100s of gigabytes), many cores and a multi-threaded BLAS.