Chapter 7 Reading data from files
Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. R input facilities are simple and their requirements are fairly strict and even rather inflexible. There is a clear presumption by the designers of R that you will be able to modify your input files using other tools, such as file editors or Perl20 to fit in with the requirements of R. Generally this is very simple.
If variables are to be held mainly in data frames, as we strongly suggest they should be, an entire data frame can be read directly with the
read.table() function. There is also a more primitive input function,
scan(), that can be called directly.
For more details on importing data into R and also exporting data, see the R Data Import/Export manual.
To read an entire data frame directly, the external file will normally have a special form.
- The first line of the file should have a name for each variable in the data frame.
- Each additional line of the file has as its first item a row label and the values for each variable.
If the file has one fewer item in its first line than in its second, this arrangement is presumed to be in force. So the first few lines of a file to be read as a data frame might look as follows.
Input file form with names and row labels: Price Floor Area Rooms Age Cent.heat 1 52.00 111.0 830 5 6.2 no 2 54.75 128.0 710 5 7.5 no 3 57.50 101.0 1000 5 4.2 no 4 57.50 131.0 690 6 8.8 no 5 59.75 93.0 900 5 1.9 yes ...
By default numeric items (except row labels) are read as numeric variables and non-numeric variables, such as
Cent.heat in the example, as factors. This can be changed if necessary.
read.table() can then be used to read the data frame directly
> HousePrice <- read.table("houses.data")
Often you will want to omit including the row labels directly and use the default labels. In this case the file may omit the row label column as in the following.
Input file form without row labels: Price Floor Area Rooms Age Cent.heat 52.00 111.0 830 5 6.2 no 54.75 128.0 710 5 7.5 no 57.50 101.0 1000 5 4.2 no 57.50 131.0 690 6 8.8 no 59.75 93.0 900 5 1.9 yes ...
The data frame may then be read as
> HousePrice <- read.table("houses.data", header=TRUE)
header=TRUE option specifies that the first line is a line of headings, and hence, by implication from the form of the file, that no explicit row labels are given.
Suppose the data vectors are of equal length and are to be read in parallel. Further suppose that there are three vectors, the first of mode character and the remaining two of mode numeric, and the file is input.dat. The first step is to use
scan() to read in the three vectors as a list, as follows
> inp <- scan("input.dat", list("",0,0))
The second argument is a dummy list structure that establishes the mode of the three vectors to be read. The result, held in
inp, is a list whose components are the three vectors read in. To separate the data items into three separate vectors, use assignments like
> label <- inp[]; x <- inp[]; y <- inp[]
More conveniently, the dummy list can have named components, in which case the names can be used to access the vectors read in. For example
> inp <- scan("input.dat", list(id="", x=0, y=0))
If you wish to access the variables separately they may either be re-assigned to variables in the working frame:
> label <- inp$id; x <- inp$x; y <- inp$y
or the list may be attached at position 2 of the search path (see Attaching arbitrary lists).
If the second argument is a single value and not a list, a single vector is read in, all components of which must be of the same mode as the dummy value.
> X <- matrix(scan("light.dat", 0), ncol=5, byrow=TRUE)
There are more elaborate input facilities available and these are detailed in the manuals.
7.3 Accessing builtin datasets
Around 100 datasets are supplied with R (in package datasets), and others are available in packages (including the recommended packages supplied with R). To see the list of datasets currently available use
All the datasets supplied with R are available directly by name. However, many packages still use the obsolete convention in which
data was also used to load datasets into R, for example
and this can still be used with the standard packages (as in this example). In most cases this will load an R object of the same name. However, in a few cases it loads several objects, so see the on-line help for the object to see what to expect.
7.3.1 Loading data from other R packages
To access data from a particular package, use the
package argument, for example
data(package="rpart") data(Puromycin, package="datasets")
If a package has been attached by
library, its datasets are automatically included in the search.
User-contributed packages can be a rich source of datasets.
7.4 Editing data
When invoked on a data frame or matrix,
edit brings up a separate spreadsheet-like environment for editing. This is useful for making small changes once a data set has been read. The command
> xnew <- edit(xold)
will allow you to edit your data set
xold, and on completion the changed object is assigned to
xnew. If you want to alter the original dataset
xold, the simplest way is to use
fix(xold), which is equivalent to
xold <- edit(xold).
> xnew <- edit(data.frame())
to enter new data via the spreadsheet interface.