Suppose we have three comma-separated text files that have information on the same three variables. Here's a look at the first few lines in one of the files:
a,b,x,y
b,B,2.49778634403711,-0.351307445767206
b,A,-0.594683138631719,1.19629936975021
b,D,1.14857619580259,0.653315728121014
c,A,0.957476532595248,0.935608419299617
It's easy to read each one in with
read.csv, and then to call the rbind (combine by
rows) function to make one big data frame. (Remember that rbind
will only work if all the data frames being combined have the same variable
names.)> one = read.csv('1.csv')
> nrow(one)
[1] 21
> two = read.csv('2.csv')
> nrow(two)
[1] 25
> three = read.csv('3.csv')
> nrow(three)
[1] 27
> big = rbind(one,two,three)
> nrow(big)
[1] 73
As I said, that was pretty easy. But now suppose we have
20 csv files that we want to read and combine. We could do what we
did with the three files, but, not only would it get tiring, there's
a chance of making an error when we have to type so many commands.
In the past, when we've had problems like this, sapply was able
to help. Let's try it here, by writing a function that will take a
number, create a filename by pasting .csv at the end, and reading
in the data:> allframes = sapply(1:20,function(x)read.csv(paste(x,'csv',sep='.')))
> head(allframes)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
a factor,21 factor,25 factor,27 factor,25 factor,27 factor,21 factor,24
b factor,21 factor,25 factor,27 factor,25 factor,27 factor,21 factor,24
x Numeric,21 Numeric,25 Numeric,27 Numeric,25 Numeric,27 Numeric,21 Numeric,24
y Numeric,21 Numeric,25 Numeric,27 Numeric,25 Numeric,27 Numeric,21 Numeric,24
[,8] [,9] [,10] [,11] [,12] [,13] [,14]
a factor,28 factor,23 factor,23 factor,22 factor,26 factor,24 factor,23
b factor,28 factor,23 factor,23 factor,22 factor,26 factor,24 factor,23
x Numeric,28 Numeric,23 Numeric,23 Numeric,22 Numeric,26 Numeric,24 Numeric,23
There were no errors, but the result certainly is strange
looking! While we haven't talked about it before, the "s" in sapply
stands for simplify, and the problem that we've just seen is that
sapply tried too hard to simplify our result. It created an
odd matrix, where each element represents one of the columns of one of
the files we read in. Fortunately, there's a function that's closely related
to sapply called lapply. The difference between sapply and
lapply is that lapply will never try to simplify its
results. It will always return a list, the same length as its first
argument, with each element of the list resulting in the function we
passed to lapply operating on one element of the first argument.
In our case, calling lapply instead of sapply will give
us a list of length 20, where each element is the result of calling
read.csv on one of the 20 files. This is where do.call
comes in. Instead of having to pass 20 data frames to rbind, we
can use do.call to pass all 20 of them to rbind, since they
are in a list, and that's exactly what do.call is looking for.> allframes = lapply(1:20,function(x)read.csv(paste(x,'csv',sep='.')))
> sapply(allframes,nrow)
[1] 21 25 27 25 27 21 24 28 23 23 22 26 24 23 25 29 28 30 27 29
> answer = do.call(rbind,allframes)
> nrow(answer)
[1] 507
We can combine all the data frames without storing them
separately or passing them individually to rbind.
No comments:
Post a Comment