多个分类变量之间的相关性(Pandas) [英] Correlation among multiple categorical variables (Pandas)

查看:521
本文介绍了多个分类变量之间的相关性(Pandas)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由 22 个分类变量(无序)组成的数据集.我想在一个漂亮的热图中可视化它们的相关性.由于 Pandas 内置函数

I have a data set made of 22 categorical variables (non-ordered). I would like to visualize their correlation in a nice heatmap. Since the Pandas built-in function

DataFrame.corr(method='pearson', min_periods=1)

只实现数值变量(Pearson、Kendall、Spearman)的相关系数,我必须自己聚合它来执行卡方或类似的东西,我不太确定在 中使用哪个函数来做它一个优雅的步骤(而不是遍历所有 cat1*cat2 对).明确地说,这就是我想要的结果(数据框):

only implement correlation coefficients for numerical variables (Pearson, Kendall, Spearman), I have to aggregate it myself to perform a chi-square or something like it and I am not quite sure which function use to do it in one elegant step (rather than iterating through all the cat1*cat2 pairs). To be clear, this is what I would like to end up with (a dataframe):

         cat1  cat2  cat3  
  cat1|  coef  coef  coef  
  cat2|  coef  coef  coef
  cat3|  coef  coef  coef

任何关于 pd.pivot_table 或类似内容的想法?

Any ideas with pd.pivot_table or something in the same vein?

提前致谢D.

推荐答案

您可以使用 pd.factorize

df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)
Out[32]: 
     a    c    d
a  1.0  1.0  1.0
c  1.0  1.0  1.0
d  1.0  1.0  1.0

数据输入

df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})

更新

from scipy.stats import chisquare

df=df.apply(lambda x : pd.factorize(x)[0])+1

pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])

Out[123]: 
     0    1    2    3
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0

df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})

这篇关于多个分类变量之间的相关性(Pandas)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆