估算缺失的数据,同时迫使相关系数保持不变 [英] Impute missing data, while forcing correlation coefficient to remain the same

查看:50
本文介绍了估算缺失的数据,同时迫使相关系数保持不变的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下(excel)数据集:

Consider the following (excel) dataset:

m   |   r
----|------
2.0 | 3.3
0.8 |   
    | 4.0
1.3 |   
2.1 | 5.2
    | 2.3
    | 1.9
2.5 | 
1.2 | 3.0
2.0 | 2.6

我的目标是使用以下条件填充缺少的值:

将以上两列之间的成对相关性表示为R(约0.68).填写空单元格后,在 之后将相关性表示为R *.在表格中进行填充,使(R-R *)^ 2 = 0 .这是我要保持数据的相关结构完整.

Denote as R the pairwise correlation between the above two columns (around 0.68). Denote as R* the correlation after the empty cells have been filled in. Fill in the table so that (R - R*)^2 = 0. This is, I want to keep the correlation structure of the data intact.

到目前为止,我已经使用Matlab完成了

So far I have done it using Matlab:

clear all;

m = xlsread('data.xlsx','A2:A11') ;
r = xlsread('data.xlsx','B2:B11') ;

rho = corr(m,r,'rows','pairwise');

x0 = [1,1,1,1,1,1];
lb = [0,0,0,0,0,0];
f = @(x)my_correl(x,rho);

SOL = fmincon(f,x0,[],[],[],[],lb)

其中函数my_correl是:

function X = my_correl(x,rho)

sum_m = (11.9 + x(1) + x(2) + x(3));
sum_r = (22.3 + x(1) + x(2) + x(3));
avg_m = (11.9 + x(1) + x(2) + x(3))/8;
avg_r = (22.3 + x(4) + x(5) + x(6))/8;
rho_num = 8*(26.32 + 4*x(1) + 2.3*x(2) + 1.9*x(3) + 0.8*x(4) + 1.3*x(5) + 2.5*x(6)) - sum_m*sum_r;
rho_den = sqrt(8*(22.43 + (4*x(1))^2 + (2.3*x(2))^2 + (1.9*x(3))^2) - sum_m^2)*sqrt(8*(78.6 + (0.8*x(4))^2 + (1.3*x(5))^ + (2.5*x(6))^2) - sum_r^2);

X = (rho - rho_num/rho_den)^2;

end

此函数手动计算相关性,其中每个丢失的数据都是变量x(i).

This function computes the correlation manually, where every missing data is a variable x(i).

问题:我的实际数据集中有超过20,000个观察值.我无法手动创建该rho公式.

The problem: my actual dataset has more than 20,000 observations. There is no way I can create that rho formula manually.

如何填写数据集?

注1:我愿意使用Python,Julia或R等替代语言.这只是我的默认语言.

Note 1: I am open to use alternative languages like Python, Julia, or R. Matlab it's just my default one.

注2:答案将得到100分的赏金.从现在开始应许.

Note 2: a 100 points bounty will be awarded to the answer. Promise from now.

推荐答案

这是我将采用的方法,并提供了R中的实现:

This is how I would approach it, with an implementation in R provided:

没有一个唯一的解决方案来估算丢失的数据点,以使完整(估算)数据的成对相关性等于不完整数据的成对相关性.因此,要找到一个好的"解决方案,而不仅仅是任何"解决方案,我们可以引入一个附加的标准,即完整的估算数据还应与原始数据共享相同的线性回归.这使我们找到了一种相当简单的方法.

There is not a unique solution for imputing the missing data points, such that the pairwise correlation of the complete (imputed) data is equal to the pairwise correlation of the incomplete data. So to find a 'good' solution rather than just 'any' solution, we can introduce an additional criteria that the complete imputed data should also share the same linear regression with the original data. This leads us to a fairly simple approach.

  1. 为原始数据计算线性回归模型.
  2. 找到归因值以寻找恰好位于此回归线上的缺失值.
  3. 为该回归线附近的估算值生成随机残差散布
  4. 缩放推算残差,以强制完整推算数据的相关性等于原始数据的相关性

R中的解决方案:

library(data.table)
set.seed(123)

rho = cor(dt$m,dt$r,'pairwise')

# calculate linear regression of original data
fit1 = lm(r ~ m, data=dt)
fit2 = lm(m ~ r, data=dt)
# extract the standard errors of regression intercept (in each m & r direction)
# and multiply s.e. by sqrt(n) to get standard deviation 
sd1 = summary(fit1)$coefficients[1,2] * sqrt(dt[!is.na(r), .N])
sd2 = summary(fit2)$coefficients[1,2] * sqrt(dt[!is.na(m), .N])

# find where data points with missing values lie on the regression line
dt[is.na(r), r.imp := coefficients(fit1)[1] + coefficients(fit1)[2] * m] 
dt[is.na(m), m.imp := coefficients(fit2)[1] + coefficients(fit2)[2] * r]

# generate randomised residuals for the missing data, using the s.d. calculated above
dt[is.na(r), r.ran := rnorm(.N, sd=sd1)] 
dt[is.na(m), m.ran := rnorm(.N, sd=sd2)] 

# function that scales the residuals by a factor x, then calculates how close correlation of imputed data is to that of original data
obj = function(x, dt, rho) {
  dt[, r.comp := r][, m.comp := m]
  dt[is.na(r), r.comp := r.imp + r.ran*x] 
  dt[is.na(m), m.comp := m.imp + m.ran*x] 
  rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
  (rho-rho2)^2
}

# find the value of x that minimises the discrepencay of imputed versus original correlation
fit = optimize(obj, c(-5,5), dt, rho)

x=fit$minimum
dt[, r.comp := r][, m.comp := m]
dt[is.na(r), r.comp := r.imp + r.ran*x] 
dt[is.na(m), m.comp := m.imp + m.ran*x] 
rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
(rho-rho2)^2  # check that rho2 is approximately equal to rho

作为最终检查,计算完整估算数据的线性回归 并绘图以显示回归线与原始数据相同.请注意,下面的图适用于下面显示的大数据集,以演示此方法在大数据上的使用.

As a final check, calculate linear regression of the complete imputed data and plot to show that regression line is same as for original data. Note that the plot below was for the large data set shown below, to demonstrate use of this method on large data.

fit.comp = lm(r.comp ~ m.comp, data=dt)
plot(dt$m.comp, dt$r.comp)
points(dt$m, dt$r, col="red")
abline(fit1, col="green")
abline(fit.comp, col="blue")
mtext(paste(" Rho =", round(rho,5)), at=-1)
mtext(paste(" Rho2 =", round(rho2, 5)), at=6)

数据

OP示例中的原始玩具数据:

original toy data from OP example:

dt=structure(list(m = c(2, 0.8, NA, 1.3, 2.1, NA, NA, 2.5, 1.2, 2), 
                  r = c(3.3, NA, 4, NA, 5.2, 2.3, 1.9, NA, 3, 2.6)), 
             .Names = c("m", "r"), row.names = c(NA, -10L), 
             class = c("data.table", "data.frame"))

更大的数据集可以在大数据上展示

A larger data set to demonstrate on big data

dt = data.table(m=rnorm(1e5, 3, 2))[, r:=1.5 + 1.1*m + rnorm(1e5,0,2)]
dt[sample(.N, 3e4), m:=NA]
dt[sample(which(!is.na(m)), 3e4), r := NA]

这篇关于估算缺失的数据,同时迫使相关系数保持不变的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆