用NA减少数据集的相关性 [英] Reducing correlation of datasets with NA

查看:137
本文介绍了用NA减少数据集的相关性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下示例数据:

a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)

目标是查找2个列之间的相关性,其中 NA 应减少相关性。 NA 表示未发生任何事件。

Objective to find correlation between 2 columns where NA should reduce the correlation. NA means that an event did not take place.

是否可以使用 NA 使得它拉低了相关值?

Is there a way to use NA in the correlation such that it pulls down the value of the correlation?

> cor(df$a, df$b)
[1] NA 

或者我应该在看其他数学函数吗?

Or should I be looking at some other mathematical function?

推荐答案

该问题没有数学意义,因为未发生的事件之间没有关联。没有事件发生就无法降低相关性。除了转换数据外,没有其他功能。

The question doesn't make mathematical sense as there is no correlation between events that didn't happen. Correlation cannot be reduced by no event happening. There is no function to do this other than to transform the data.

您可以将 NA 值替换为@Ujjwal Kumar建议,但这只是数据操作,而不是预定义的功能

You may replace the NA values with something like @Ujjwal Kumar has suggested but this is just data manipulation and not a predefined function

查看帮助文件中的cor ?cor 并使用诸如 cor(df $ a,df $ b,use = pairwise.complete.obs 您可以看到 NA 值通常应在刚刚删除且对关联本身没有影响的地方处理

Look at the help file for cor ?cor and using functions like cor(df$a,df$b,use="pairwise.complete.obs" you can see how NA values should usually be treated where they are just removed and have no impact on the correlation itself


如果使用是所有,则NA将在概念上传播,即,只要其贡献观察值之一是NA,结果值将是NA。

If use is "everything", NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA.

如果使用是 all.obs如果使用的是 complete.obs,则通过逐项删除来处理缺少的值(如果没有完整的案例,则会产生错误)。

If use is "all.obs", then the presence of missing observations will produce an error. If use is "complete.obs" then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).

不完整除非没有完整的案例,否则结果相同。最后,如果use具有值

"na.or.complete" is the same unless there are no complete cases, that gives NA. Finally, if use has the value

pairwise.complete.obs,然后使用这些变量的所有完整观测对来计算每对变量之间的相关性或协方差。如果该变量对没有完整的对,则可能导致协方差或相关矩阵不是正半定的,也可能导致NA条目。对于cov和var, pairwise.complete.obs是指。仅适用于皮尔逊方法。注意,(等于)var(double(0),use = *)给出NA为use =一切。和 na.or.complete,并在其他情况下给出错误。

"pairwise.complete.obs" then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables. For cov and var, "pairwise.complete.obs" only works with the "pearson" method. Note that (the equivalent of) var(double(0), use = *) gives NA for use = "everything" and "na.or.complete", and gives an error in the other cases.

这篇关于用NA减少数据集的相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆