计算每行和每组的平均差 [英] Calculate mean difference per row and per group

查看:64
本文介绍了计算每行和每组的平均差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 data.frame ,其中包含许多行和列,我想计算组中每个值与其他每个值的平均差.
下面是一个示例:

I have a data.frame with many rows and columns and I want to calculate the mean difference of each value to each of the other values within a group.
Here an example:

 ID  value 
 1    4 
 1    5 
 1    7 
 2    8 
 2    6 
 2    5 
 2    6

这是我要计算的:

ID  value  value_mean_diff 
 1    4     (4-5)^2 + (4-7)^2 /groupsize = 3
 1    5     (5-4)^2 + (5-7)^2 / 3
 1    7     (7-4)^2 + (7-5)^2 / 3
 2    8     (8-6)^2 + (8-5)^2 + (8-6)^2 / 4
 2    6     (6-8)^2 + (6-5)^2 + (6-6)^2 / 4
 2    5     (5-8)^2 + (5-6)^2 + (5-6)^2 / 4
 2    6     (6-8)^2 + (6-6)^2 + (6-5)^2 / 4 

我尝试使用aggregate(),但未能使其正常工作.

I tried using aggregate() but failed to make it work.

推荐答案

一种在 data.table 库中使用 crossjoin 的解决方案,但存在从中删除重复行的缺陷.原始数据框:

A solution using crossjoin in data.table library with a defect of removing the duplicated row from the original data frame:

> dt <- setDT(df)[,setNames(CJ(value, value), c("value", "value1")), .(ID)][,.(value_mean_diff = sum((value-value1)^2)/.N),.(ID, value)]
> dt
   ID value value_mean_diff
1:  1     4        3.333333
2:  1     5        1.666667
3:  1     7        4.333333
4:  2     5        2.750000
5:  2     6        1.250000
6:  2     8        4.250000

由于重复的行始终具有相同的 value_mean_diff ,因此您始终可以合并它们以获取所有重复的行.

Since duplicated rows always have the same value_mean_diff, you can always merge them to get all the duplicated rows.

> merge(dt, df, by = c("ID", "value"))
   ID value value_mean_diff
1:  1     4        3.333333
2:  1     5        1.666667
3:  1     7        4.333333
4:  2     5        2.750000
5:  2     6        1.250000
6:  2     6        1.250000
7:  2     8        4.250000

更新:由于上述方法占用大量内存,因此您可以利用 value_mean_diff =(值-value_mean)^ 2 +方差(值)的事实,您可以通过基于方差的方差展开来证明这一点定义.事实如此,您可以通过以下方式进行计算:

Update: Since the above method is memory intensive, you can take advantage of the fact that your value_mean_diff = (value - value_mean)^2 + variance(value), which you can prove by expanding the variance based on its definition. With this as a fact, you can calculate by the following way:

> setDT(df)[, value_mean_diff := (value - mean(value))^2 + var(value) * (.N - 1) / .N, .(ID)]
> df
   ID value value_mean_diff
1:  1     4        3.333333
2:  1     5        1.666667
3:  1     7        4.333333
4:  2     8        4.250000
5:  2     6        1.250000
6:  2     5        2.750000
7:  2     6        1.250000

请记住,R中的 var()函数计算样本方差,因此您需要通过乘以因子(n-1)/n 将其转换为总体方差.strong>.

Keep in mind that the var() function in R calculate the sample variance so you need to convert it to population variance by multiplying a factor (n-1)/n.

这篇关于计算每行和每组的平均差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆