R:如何在不耗尽内存的情况下绑定两个巨大的数据帧 [英] R: how to rbind two huge data-frames without running out of memory

查看:20
本文介绍了R:如何在不耗尽内存的情况下绑定两个巨大的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框 df1df2,每个数据框大约有 1000 万行和 4 列.我使用 RODBC/sqlQuery 将它们读入 R 没有任何问题,但是当我尝试 rbind 它们时,我得到了最可怕的 R 错误消息:无法分配内存.必须有更有效的方法来更有效地执行 rbind —— 有没有人想分享他们最喜欢的技巧?例如,我在 sqldf 的文档中找到了这个例子:

I have two data-frames df1 and df2 that each have around 10 million rows and 4 columns. I read them into R using RODBC/sqlQuery with no problems, but when I try to rbind them, I get that most dreaded of R error messages: cannot allocate memory. There have got to be more efficient ways to do an rbind more efficiently -- anyone have their favorite tricks on this they want to share? For instance I found this example in the doc for sqldf:

# rbind
a7r <- rbind(a5r, a6r)
a7s <- sqldf("select * from a5s union all select * from a6s")

这是最好的/推荐的方法吗?

Is that the best/recommended way to do it?

更新正如 JD Long 在他对 这个问题

推荐答案

与其在开始时将它们读入 R 然后组合它们,不如让 SQLite 在将它们发送到 R 之前读取它们并将它们组合起来.这样文件是永远不会单独加载到 R 中.

Rather than reading them into R at the beginning and then combining them you could have SQLite read them and combine them before sending them to R. That way the files are never individually loaded into R.

# create two sample files
DF1 <- data.frame(A = 1:2, B = 2:3)
write.table(DF1, "data1.dat", sep = ",", quote = FALSE)
rm(DF1)

DF2 <- data.frame(A = 10:11, B = 12:13)
write.table(DF2, "data2.dat", sep = ",", quote = FALSE)
rm(DF2)

# now we do the real work
library(sqldf)

data1 <- file("data1.dat")
data2 <- file("data2.dat")

sqldf(c("select * from data1", 
 "insert into data1 select * from data2", 
 "select * from data1"), 
 dbname = tempfile())

这给出:

>  sqldf(c("select * from data1", "insert into data1 select * from data2", "select * from data1"), dbname = tempfile())
   A  B
1  1  2
2  2  3
3 10 12
4 11 13

如果行顺序不重要,这个较短的版本也适用:

This shorter version also works if row order is unimportant:

sqldf("select * from data1 union select * from data2", dbname = tempfile())

查看 sqldf 主页 http://sqldf.googlecode.com?sqldf 了解更多信息.请特别注意文件格式参数,因为它们与 read.table 接近但不完全相同.这里我们使用了默认值,所以问题不大.

See the sqldf home page http://sqldf.googlecode.com and ?sqldf for more info. Pay particular attention to the file format arguments since they are close but not identical to read.table. Here we have used the defaults so it was less of an issue.

这篇关于R:如何在不耗尽内存的情况下绑定两个巨大的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆