计算每行和每组的平均差 [英] Calculate mean difference per row and per group
问题描述
我有一个 data.frame
,其中包含许多行和列,我想计算组中每个值与其他每个值的平均差.
下面是一个示例:
I have a data.frame
with many rows and columns and I want to calculate the mean difference of each value to each of the other values within a group.
Here an example:
ID value
1 4
1 5
1 7
2 8
2 6
2 5
2 6
这是我要计算的:
ID value value_mean_diff
1 4 (4-5)^2 + (4-7)^2 /groupsize = 3
1 5 (5-4)^2 + (5-7)^2 / 3
1 7 (7-4)^2 + (7-5)^2 / 3
2 8 (8-6)^2 + (8-5)^2 + (8-6)^2 / 4
2 6 (6-8)^2 + (6-5)^2 + (6-6)^2 / 4
2 5 (5-8)^2 + (5-6)^2 + (5-6)^2 / 4
2 6 (6-8)^2 + (6-6)^2 + (6-5)^2 / 4
我尝试使用aggregate(),但未能使其正常工作.
I tried using aggregate() but failed to make it work.
推荐答案
一种在 data.table
库中使用 crossjoin
的解决方案,但存在从中删除重复行的缺陷.原始数据框:
A solution using crossjoin
in data.table
library with a defect of removing the duplicated row from the original data frame:
> dt <- setDT(df)[,setNames(CJ(value, value), c("value", "value1")), .(ID)][,.(value_mean_diff = sum((value-value1)^2)/.N),.(ID, value)]
> dt
ID value value_mean_diff
1: 1 4 3.333333
2: 1 5 1.666667
3: 1 7 4.333333
4: 2 5 2.750000
5: 2 6 1.250000
6: 2 8 4.250000
由于重复的行始终具有相同的 value_mean_diff
,因此您始终可以合并它们以获取所有重复的行.
Since duplicated rows always have the same value_mean_diff
, you can always merge them to get all the duplicated rows.
> merge(dt, df, by = c("ID", "value"))
ID value value_mean_diff
1: 1 4 3.333333
2: 1 5 1.666667
3: 1 7 4.333333
4: 2 5 2.750000
5: 2 6 1.250000
6: 2 6 1.250000
7: 2 8 4.250000
更新:由于上述方法占用大量内存,因此您可以利用 value_mean_diff =(值-value_mean)^ 2 +方差(值)的事实,您可以通过基于方差的方差展开来证明这一点定义.事实如此,您可以通过以下方式进行计算:
Update: Since the above method is memory intensive, you can take advantage of the fact that your value_mean_diff = (value - value_mean)^2 + variance(value), which you can prove by expanding the variance based on its definition. With this as a fact, you can calculate by the following way:
> setDT(df)[, value_mean_diff := (value - mean(value))^2 + var(value) * (.N - 1) / .N, .(ID)]
> df
ID value value_mean_diff
1: 1 4 3.333333
2: 1 5 1.666667
3: 1 7 4.333333
4: 2 8 4.250000
5: 2 6 1.250000
6: 2 5 2.750000
7: 2 6 1.250000
请记住,R中的 var()
函数计算样本方差,因此您需要通过乘以因子(n-1)/n 将其转换为总体方差.strong>.
Keep in mind that the var()
function in R calculate the sample variance so you need to convert it to population variance by multiplying a factor (n-1)/n.
这篇关于计算每行和每组的平均差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!