一组高度相关的变量 [英] Group of Highly correlated variables

查看：108 发布时间：2020/10/10 1:42:28 r grouping correlation

本文介绍了一组高度相关的变量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，我想找出哪一组变量共享最高的相关性。例如：

I have a dataframe and I want to find which group of variables share highest correlations. For example:

mydata <- structure(list(V1 = c(1L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 43L), 
                         V2 = c(2L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 41L), 
                         V3 = c(10L, 20L, 10L, 20L, 10L, 20L, 1L, 0L, 1L, 2010L,20L, 10L, 10L, 10L, 10L, 10L), 
                         V4 = c(2L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 1L), 
                         V5 = c(4L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 3L)), 
                    .Names = c("V1", "V2", "V3", "V4", "V5"), 
                    class = "data.frame", row.names = c(NA,-16L))

我可以计算相关系数，并找到每对具有高于阈值的相关系数：

I can calculate corelations and find each pair having corelations above a threshold as:

var.corelation <- cor(as.matrix(mydata), method="pearson")

fin.corr = as.data.frame( as.table( var.corelation ) )
combinations_1 = combn( colnames( var.corelation ) , 2 , FUN = function( x )  paste( x , collapse = "_" ) )
fin.corr = fin.corr[ fin.corr$Var1 != fin.corr$Var2 , ]

fin.corr = fin.corr [order(fin.corr$Freq, decreasing = TRUE) , ,drop = FALSE]

fin.corr = fin.corr[ paste( fin.corr$Var1 , fin.corr$Var2 , sep = "_" ) %in% combinations_1 , ]

fin.corr <- fin.corr[fin.corr$Freq > 0.62, ]

fin.corr <- fin.corr[order(fin.corr$Var1, fin.corr$Var2), ]
fin.corr

到目前为止的输出是：

Var1 Var2      Freq
V1   V2      0.9999978
V3   V4      0.6212136
V3   V5      0.6220380
V4   V5      0.9992690

这里 V1 和 V2 组成一个小组，而其他 V3 ， V4 ， V5 组成另一组，其中每对变量相关性高于阈值。我想将这两组变量作为列表。例如

Here V1 and V2 forms a group while others V3, V4, V5 forms another group where each pair of variables have correlation higher than the threshold. I want to get these two groups of variables as a list. For example

list(c("V1", "V2"), c("V3", "V4", "V5"))

推荐答案

使用图论和 igraph 包。

var.corelation <- cor(as.matrix(mydata), method="pearson")

library(igraph)
# prevent duplicated pairs
var.corelation <- var.corelation*lower.tri(var.corelation)
check.corelation <- which(var.corelation>0.62, arr.ind=TRUE)

graph.cor <- graph.data.frame(check.corelation, directed = FALSE)
groups.cor <- split(unique(as.vector(check.corelation)),         clusters(graph.cor)$membership)
lapply(groups.cor,FUN=function(list.cor){rownames(var.corelation)[list.cor]})

其中返回值：

$`1`
[1] "V1" "V2"

$`2`
[1] "V3" "V4" "V5"

我还会检查我的评论，这对我来说会导致更好的insi ghts，因为您的相关性可能低于（任意）临界点，但实际上与群集相关。

I would also check my comment, that for me leads for better insights as you may have correlations lesser than your (arbitrary) cutpoint but really associated with a cluster.

这篇关于一组高度相关的变量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

一组高度相关的变量 [英] Group of Highly correlated variables

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

一组高度相关的变量 [英] Group of Highly correlated variables

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭