“环通” data.table计算条件平均值 [英] "Loop through" data.table to calculate conditional averages

查看：101 发布时间：2017/3/12 11:48:21 r data.table

本文介绍了“环通” data.table计算条件平均值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想循环数据表的行，并计算每行的平均值。平均值应根据以下机制计算：

I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:

在第i行（ID（i））中查找标识符ID

查找第i行（T2（i））中T2的值

计算 Data1 所有行 j 中的值满足以下两个条件： ID（j）= ID（i）和 T1（j）= T2（i）

输入第i行的Data2中的计算平均值

Look up the identifier ID in row i (ID(i))
Look up the value of T2 in row i (T2(i))
Calculate the average over the Data1 values in all rows j, which meet these two criteria: ID(j) = ID(i) and T1(j) = T2(i)
Enter the calculated average in the column Data2 of row i

 DF = data.frame(ID=rep(c("a","b"),each=6), 
             T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
 DT = data.table(DF)
 DT[ , Data2:=NA_real_]
     ID T1 T2  Data1 Data2
[1,]  a  1  1     1    NA
[2,]  a  1  2     2    NA
[3,]  a  1  3     3    NA
[4,]  a  2  1     4    NA
[5,]  a  2  2     5    NA
[6,]  a  2  3     6    NA
[7,]  b  1  1     7    NA
[8,]  b  1  2     8    NA
[9,]  b  1  3     9    NA
[10,] b  2  1    10    NA
[11,] b  2  2    11    NA
[12,] b  2  3    12    NA

对于这个简单的例子，结果应该是这样：

For this simple example the result should look like this:

      ID T1 T2  Data1 Data2
[1,]  a  1  1     1    2
[2,]  a  1  2     2    5
[3,]  a  1  3     3    NA
[4,]  a  2  1     4    2
[5,]  a  2  2     5    5
[6,]  a  2  3     6    NA
[7,]  b  1  1     7    8
[8,]  b  1  2     8    11
[9,]  b  1  3     9    NA
[10,] b  2  1    10    8
[11,] b  2  2    11    11
[12,] b  2  3    12    NA

我认为这样做的一种方法是循环通过行，但我认为这是低效率。我已经看过 apply（）函数，但我确定如果它会解决我的问题。我也可以使用 data.frame 而不是 data.table 如果这将使它更有效或更容易。实际数据集包含大约100万行。

I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply() function, but I'm sure if it would solve my problem. I could also use data.frame instead of data.table if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.

推荐答案

经验法则是首先聚合，然后加入。 / p>

The rule of thumb is to aggregate first, and then join to that.

agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
      ID T1 T2 Data1 Data2
 [1,]  a  1  1     1     2
 [2,]  a  1  2     2     5
 [3,]  a  1  3     3    NA
 [4,]  a  2  1     4     2
 [5,]  a  2  2     5     5
 [6,]  a  2  3     6    NA
 [7,]  b  1  1     7     8
 [8,]  b  1  2     8    11
 [9,]  b  1  3     9    NA
[10,]  b  2  1    10     8
[11,]  b  2  2    11    11
[12,]  b  2  3    12    NA

正如你所看到的，在这种情况下有点丑它计划添加 drop ，这将避免 [[3]] 位，也许我们可以提供一种方法在调用范围（即没有自连接）中， [。data.table 来评估 i c $ c> $ 和 DT 。


As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop which will avoid the [[3]] bit, and maybe we could provide a way to tell [.data.table to evaluate i in calling scope (i.e. no self join) which would avoid the JT= bit which is needed here because ID is in both agg and DT.
  .8.0 R-Forge，这样也避免了 setkey 的需要。


                        这篇关于“环通” data.table计算条件平均值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

“环通” data.table计算条件平均值 [英] "Loop through" data.table to calculate conditional averages

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

“环通” data.table计算条件平均值 [英] &quot;Loop through&quot; data.table to calculate conditional averages

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

“环通” data.table计算条件平均值 [英] "Loop through" data.table to calculate conditional averages

登录关闭