展开两个大数据文件并使用data.table应用？ [英] Expand two large data files and apply using data.table?

查看：173 发布时间：2017/3/12 12:30:25 r memory data.table

本文介绍了展开两个大数据文件并使用data.table应用？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图将一个函数应用于两个数据集 df1 和 df2 其中 df1 包含（a，b），可以是100万行， df2 包含（x，y，z），可以非常大，从〜100到> 10,000。我想对两个数据集的每个组合应用一个函数 foo ，然后求和第二个数据集。


I am attempting to apply a function to two data sets df1 and df2 where df1 contains (a, b) and can be 1 million rows long, and df2 contains (x, y, z) and can be very large, anywhere from ~100 to >10,000.  I would like to apply a function foo over every combination of both data sets and then sum over the second data set.  
foo <- function(a, b, x, y, z) a + b + x + y + z
df1 <- data.frame(a = 1:10, b = 11:20)
df2 <- data.frame(x= 1:5, y = 21:25, z = 31:35)

我用来应用此函数的代码（取自@jlhoward这里如何避免R中有多个变量的多个循环） 
The code I am using to apply this function (taken from @jlhoward here How to avoid multiple loops with multiple variables in R)
foo.new <- function(p1, p2) {
  p1 = as.list(p1); p2 = as.list(p2)
  foo(p1$a, p1$b, p2$x, p2$y, p2$z)
}

indx   <- expand.grid(indx2 = seq(nrow(df2)), indx1 = seq(nrow(df1)))
result <- with(indx, foo.new(df1[indx1, ], df2[indx2, ]))
sums   <- aggregate(result, by = list(rep(seq(nrow(df1)), each = nrow(df2))), sum)

但是，由于 df2 变大执行上面的 result 函数（运行64位PC与32GB RAM）。 
However, as df2 gets large (>1000) I quickly run out of memory to perform the result function above (running 64bit PC with 32GB RAM). 
我已经阅读了 data.table 很多，但不能评估是否有一个函数这将有助于节省内存。用替换并在 result 步骤或展开时创建一个较小的文件。在 index 步骤中创建网格，这将创建到目前为止最大的文件。 

I have read about data.table quite a bit but can't evaluate whether there is a function in there that would assist in saving memory.  Something that would replace with and create a smaller file at the result step, or expand.grid at the index step, which creates the largest file by far.  
推荐答案
这里是一个data.table解决方案：应该很快：
Here is a data.table solution: should be pretty fast: 
library(data.table)
indx<-CJ(indx1=seq(nrow(df2)),indx2=seq(nrow(df1))) #CJ is data.table function for expand.grid
 indx[,`:=`(result=foo.new(df1[indx1, ], df2[indx2, ]),Group.1=rep(seq(nrow(df1)), each = nrow(df2)))][,.(sums=sum(result)),by=Group.1]

    Group.1 sums
 1:       1  355
 2:       2  365
 3:       3  375
 4:       4  385
 5:       5  395
 6:       6  405
 7:       7  415
 8:       8  425
 9:       9  435
10:      10  445


                        这篇关于展开两个大数据文件并使用data.table应用？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

展开两个大数据文件并使用data.table应用？ [英] Expand two large data files and apply using data.table?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

展开两个大数据文件并使用data.table应用？ [英] Expand two large data files and apply using data.table?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭