Chapter 12 Current and future directions
This chapter is for notes about possible in-progress and future changes to R: there is no commitment to release such changes, let alone to a timescale.
12.1 Long vectors
Vectors in R 2.x.y were limited to a length of 2^31 - 1 elements (about 2 billion), as the length is stored in the
SEXPREC as a C
int, and that type is used extensively to record lengths and element numbers, including in packages.
Note that longer vectors are effectively impossible under 32-bit platforms because of their address limit, so this section applies only on 64-bit platforms. The internals are unchanged on a 32-bit build of R.
A single object with 2^31 or more elements will take up at least 8GB of memory if integer or logical and 16GB if numeric or character, so routine use of such objects is still some way off.
There is now some support for long vectors. This applies to raw, logical, integer, numeric and character vectors, and lists and expression vectors. (Elements of character vectors (
CHARSXPs) remain limited to 2^31 - 1 bytes.) Some considerations:
This has been implemented by recording the length (and true length) as
-1and recording the actual length as a 64-bit field at the beginning of the header. Because a fair amount of code in R uses a signed type for the length, the ‘long length’ is recorded using the signed C99 type
ptrdiff_t, which is typedef-ed to
- These can in theory have 63-bit lengths, but note that current 64-bit OSes do not even theoretically offer 64-bit address spaces and there is currently a 52-bit limit (which exceeds the theoretical limit of current OSes and ensures that such lengths can be stored exactly in doubles).
The serialization format has been changed to accommodate longer lengths, but vectors of lengths up to 2^31-1 are stored in the same way as before. Longer vectors have their length field set to
-1and followed by two 32-bit fields giving the upper and lower 32-bits of the actual length. There is currently a sanity check which limits lengths to 2^48 on unserialization.
R_xlen_tis made available to packages in C header Rinternals.h: this should be fine in C code since C99 is required. People do try to use R internals in C++, but C++98 compilers are not required to support these types.
Indexing can be done via the use of doubles. The internal indexing code used to work with positive integer indices (and negative, logical and matrix indices were all converted to positive integers): it now works with either
lengthwas documented to currently return an integer, possibly
NA. A lot of code has been written that assumes that, and even code which calls
as.integer(length(x))before passing to
.Fortranrarely checks for an
There is a new function
xlengthwhich works for long vectors and returns a double value if the length exceeds 2^31-1. At present
NAfor long vectors, but it may be safer to make that an error.
12.2 64-bit types
There is also some desire to be able to store larger integers in R, although the possibility of storing these as
double is often overlooked (and e.g. file pointers as returned by
seek are already stored as
Different routes have been proposed:
Add a new type to R and use that for lengths and indices—most likely this would be a 64-bit signed type, say
longint. R’s usual implicit coercion rules would ensure that supplying an
integervector for indexing or
A more radical alternative is to change the existing
integertype to be 64-bit on 64-bit platforms (which was the approach taken by S-PLUS for DEC/Compaq Alpha systems). Or even on all platforms.
doublevalues for lengths and indices, and return
doubleonly when necessary.
The third has the advantages of minimal disruption to existing code and not increasing memory requirements. In the first and third scenarios both R’s own code and user code would have to be adapted for lengths that were not of type
integer, and in the third code branches for long vectors would be tested rarely.
Most users of the
.Fortran interfaces use
as.integer for lengths and element numbers, but a few omit these in the knowledge that these were of type
integer. It may be reasonable to assume that these are never intended to be used with long vectors.
The remaining interfaces will need to cope with the changed
VECTOR_SEXPREC types. It seems likely that in most cases lengths are accessed by the
LENGTH functions27 The current approach is to keep these returning 32-bit lengths and introduce ‘long’ versions
XLENGTH which return
12.3 Large matrices
Matrices are stored as vectors and so were also limited to 2^31-1 elements. Now longer vectors are allowed on 64-bit platforms, matrices with more elements are supported provided that each of the dimensions is no more than 2^31-1. However, not all applications can be supported.
The main problem is linear algebra done by FORTRAN code compiled with 32-bit
INTEGER. Although not guaranteed, it seems that all the compilers currently used with R on a 64-bit platform allow matrices each of whose dimensions is less than 2^31 but with more than 2^31 elements, and index them correctly, and a substantial part of the support software (such as BLAS and LAPACK) also work.
There are exceptions: for example some complex LAPACK auxiliary routines do use a single
INTEGER index and hence overflow silently and segfault or give incorrect results. One example is
svd() on a complex matrix.
Since this is implementation-dependent, it is possible that optimized BLAS and LAPACK may have further restrictions, although none have yet been encountered. For matrix algebra on large matrices one almost certainly wants a machine with a lot of RAM (100s of gigabytes), many cores and a multi-threaded BLAS.