“环通” data.table计算条件平均值 [英] "Loop through" data.table to calculate conditional averages
问题描述
我想循环数据表的行,并计算每行的平均值。平均值应根据以下机制计算:
I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:
- 在第i行(ID(i))中查找标识符ID
- 查找第i行(T2(i))中T2的值
- 计算
Data1
所有行j
中的值满足以下两个条件:ID(j)= ID(i)
和T1(j)= T2(i)
- 输入第i行的Data2中的计算平均值
- Look up the identifier ID in row i (ID(i))
- Look up the value of T2 in row i (T2(i))
- Calculate the average over the
Data1
values in all rowsj
, which meet these two criteria:ID(j) = ID(i)
andT1(j) = T2(i)
Enter the calculated average in the column Data2 of row i
DF = data.frame(ID=rep(c("a","b"),each=6),
T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
DT = data.table(DF)
DT[ , Data2:=NA_real_]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 NA
[2,] a 1 2 2 NA
[3,] a 1 3 3 NA
[4,] a 2 1 4 NA
[5,] a 2 2 5 NA
[6,] a 2 3 6 NA
[7,] b 1 1 7 NA
[8,] b 1 2 8 NA
[9,] b 1 3 9 NA
[10,] b 2 1 10 NA
[11,] b 2 2 11 NA
[12,] b 2 3 12 NA
对于这个简单的例子,结果应该是这样:
For this simple example the result should look like this:
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
我认为这样做的一种方法是循环通过行,但我认为这是低效率。我已经看过 apply()
函数,但我确定如果它会解决我的问题。我也可以使用 data.frame
而不是 data.table
如果这将使它更有效或更容易。实际数据集包含大约100万行。
I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply()
function, but I'm sure if it would solve my problem. I could also use data.frame
instead of data.table
if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.
推荐答案
经验法则是首先聚合,然后加入。 / p>
The rule of thumb is to aggregate first, and then join to that.
agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
正如你所看到的,在这种情况下有点丑它计划添加 drop
,这将避免 [[3]]
位,也许我们可以提供一种方法在调用范围(即没有自连接)中, [。data.table
来评估 i
c $ c> $
和
DT
。
As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop
which will avoid the [[3]]
bit, and maybe we could provide a way to tell [.data.table
to evaluate i
in calling scope (i.e. no self join) which would avoid the JT=
bit which is needed here because ID
is in both agg
and DT
.
.8.0 R-Forge,这样也避免了
setkey
的需要。
这篇关于“环通” data.table计算条件平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!