计算列对之间的卡方 [英] Calculate chi-sqaure between pairs of columns
问题描述
我想计算 Pandas 数据帧中成对列之间的卡方检验统计量.似乎必须有一种与 pandas.corr
I am wanting to calculate a chi-squared test statistic between pairs of columns in a pandas dataframe. It seems like there must be a way to do this in a similar fashion to pandas.corr
如果我有以下数据框
df = pd.DataFrame([['a', 'x', 'a'],
['b', 'z', 'a'],
['a', 'x', 'a']],
columns=['ll', 'kk', 'jj'],
index=['nn', 'oo', 'pp'])
我希望能够做到:
df.corr('chisquare')
虽然这显然会失败.如果数据框是数字的,而不是分类的,我可以简单地执行 df.corr()
并通过 spearman 或 pearson.所有列之间也必须有计算卡方的方法
Though this will obviously fail. If the dataframe was numeric, not categorical I could simply do df.corr()
and pass either spearman or pearson. There must be a way of calculating chi-sqaured between all of the columns as well
所以输出(使用scipy.stats.chi2_contingency
)将是
ll kk jj
ll 0.0000 0.1875 0.0
kk 0.1875 0.0000 0.0
jj 0.0000 0.0000 0.0
我是否只是遗漏了什么,或者如果不单独对流程的每个步骤进行编码,这是不可能的.我正在寻找类似 pd.corr
但具有分类数据的东西.
Am I just missing something, or is this not possible without coding each step of the process individually. I am looking for something like pd.corr
but with categorical data.
为了消除对我目前正在做什么以获得结果矩阵的任何混淆:
In order to clear up any confusion as to what I'm currently doing in order to get the resulting matrix:
from itertools import combinations
def get_corr_mat(df, f=chi2_contingency):
columns = df.columns
dm = pd.DataFrame(index=columns, columns=columns)
for var1, var2 in combinations(columns, 2):
cont_table = pd.crosstab(df[var1], df[var2], margins=False)
chi2_stat = f(cont_table)[0]
dm.loc[var2, var1] = chi2_stat
dm.loc[var1, var2] = chi2_stat
dm.fillna(0, inplace=True)
return dm
get_corr_mat(df)
正如我之前所说,这确实有效,但它可能会变慢并且未经测试.熊猫方法会更可取
As I've stated previously this does work, though it can get slow and is not tested. A pandas method would be much preferable
推荐答案
替代方法 1
另一种在列对之间查找卡方检验统计量以及热图可视化的方法:
Alternate Method 1
Another way to find chi-squared test statistic between pairs of columns along with heatmap visualisation:
def ch_calculate(df):
factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values]
chi2, p_values =[], []
for f in factors_paired:
if f[0] != f[1]:
chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]]))
chi2.append(chitest[0])
p_values.append(chitest[1])
else: # for same factor pair
chi2.append(0)
p_values.append(0)
chi2 = np.array(chi2).reshape((len(df.columns),len(df.columns))) # shape it as a matrix
chi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(chi2, annot = True)
plt.show()
ch_calculate(df_categorical)
其中 df_categorical
是一个包含数据集所有名义输入变量的数据框,对于序数分类变量,我认为最好使用 .corr(method='spearman')
(斯皮尔曼等级相关系数)
Where df_categorical
is a dataframe with all nominal input variables of a dataset, for ordinal categorical variables I think it is better to use .corr(method='spearman')
(spearman rank correlation coefficient)
我还遇到了这个 Cramers V 实现来查找分类变量之间的关联程度:分类特征相关性通过使用它,我创建了另一个函数来创建热图可视化以查找相关的分类列(在 Cramers V 中,您将在热图中找到从 0 到 1 的值,其中 0 表示没有关联,1 表示高度关联)
Also I came across this Cramers V implementation to find degree of association between categorical variables: Categorical features correlation By using this, I created another function to create heatmap visualisation to find correlated categorical columns (In Cramers V, you will find values from 0 to 1 in heatmap where 0 means no association and 1 mean high association)
from itertools import combinations
from scipy.stats import chi2_contingency
import scipy.stats as ss
import seaborn as sns
def get_corr_mat(df, f=chi2_contingency):
columns = df.columns
dm = pd.DataFrame(index=columns, columns=columns)
for var1, var2 in combinations(columns, 2):
cont_table = pd.crosstab(df[var1], df[var2], margins=False)
chi2_stat = cramers_v(cont_table.values)
dm.loc[var2, var1] = chi2_stat
dm.loc[var1, var2] = chi2_stat
dm.fillna(1, inplace=True)
return dm
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
cat_corr= get_corr_mat(df_categorical)
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(cat_corr, annot = True)
plt.show()
这篇关于计算列对之间的卡方的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!