使用 data.table (with fread) 快速读取和组合多个文件 [英] Fast reading and combining several files using data.table (with fread)
问题描述
我有几个结构相同的不同 txt 文件.现在我想使用 fread 将它们读入 R,然后将它们合并到一个更大的数据集中.
I have several different txt files with the same structure. Now I want to read them into R using fread, and then union them into a bigger dataset.
## First put all file names into a list
library(data.table)
all.files <- list.files(path = "C:/Users",pattern = ".txt")
## Read data using fread
readdata <- function(fn){
dt_temp <- fread(fn, sep=",")
keycols <- c("ID", "date")
setkeyv(dt_temp,keycols) # Notice there's a "v" after setkey with multiple keys
return(dt_temp)
}
# then using
mylist <- lapply(all.files, readdata)
mydata <- do.call('rbind',mylist)
代码运行良好,但速度并不理想.每个 txt 文件有 1M 的观测值和 12 个字段.
The code works fine, but the speed is not satisfactory. Each txt file has 1M observations and 12 fields.
如果我使用 fread
读取单个文件,它会很快.但是使用apply
的话,速度极慢,而且明显比一个个地读取文件要花很多时间.我想知道这里哪里出错了,速度增益有什么改进吗?
If I use the fread
to read a single file, it's fast. But using apply
, then speed is extremely slow, and obviously take much time than reading files one by one. I wonder where went wrong here, is there any improvements for the speed gain?
我试过plyr
包中的llply
,速度提升不大.
I tried the llply
in plyr
package, there're not much speed gains.
另外,data.table
中是否有任何语法来实现垂直连接,如sql
rbind和union
>?
Also, is there any syntax in data.table
to achieve vertical join like rbind
and union
in sql
?
谢谢.
推荐答案
使用 rbindlist()
旨在 rbind
一个 list
data.table
一起...
Use rbindlist()
which is designed to rbind
a list
of data.table
's together...
mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )
正如 @Roland 所说,不要在函数的每次迭代中设置键!
And as @Roland says, do not set the key in each iteration of your function!
总而言之,这是最好的:
So in summary, this is best :
l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )
这篇关于使用 data.table (with fread) 快速读取和组合多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!