计算列对之间的卡方 [英] Calculate chi-sqaure between pairs of columns

查看:66
本文介绍了计算列对之间的卡方的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算 Pandas 数据帧中成对列之间的卡方检验统计量.似乎必须有一种与 pandas.corr

I am wanting to calculate a chi-squared test statistic between pairs of columns in a pandas dataframe. It seems like there must be a way to do this in a similar fashion to pandas.corr

如果我有以下数据框

df = pd.DataFrame([['a', 'x', 'a'], 
                   ['b', 'z', 'a'], 
                   ['a', 'x', 'a']], 
                  columns=['ll', 'kk', 'jj'], 
                  index=['nn', 'oo', 'pp'])

我希望能够做到:

df.corr('chisquare')

虽然这显然会失败.如果数据框是数字的,而不是分类的,我可以简单地执行 df.corr() 并通过 spearman 或 pearson.所有列之间也必须有计算卡方的方法

Though this will obviously fail. If the dataframe was numeric, not categorical I could simply do df.corr() and pass either spearman or pearson. There must be a way of calculating chi-sqaured between all of the columns as well

所以输出(使用scipy.stats.chi2_contingency)将是

    ll      kk      jj
ll  0.0000  0.1875  0.0
kk  0.1875  0.0000  0.0
jj  0.0000  0.0000  0.0

我是否只是遗漏了什么,或者如果不单独对流程的每个步骤进行编码,这是不可能的.我正在寻找类似 pd.corr 但具有分类数据的东西.

Am I just missing something, or is this not possible without coding each step of the process individually. I am looking for something like pd.corr but with categorical data.

为了消除对我目前正在做什么以获得结果矩阵的任何混淆:

In order to clear up any confusion as to what I'm currently doing in order to get the resulting matrix:

from itertools import combinations
def get_corr_mat(df, f=chi2_contingency):
    columns = df.columns
    dm = pd.DataFrame(index=columns, columns=columns)
    for var1, var2 in combinations(columns, 2):
        cont_table = pd.crosstab(df[var1], df[var2], margins=False)
        chi2_stat = f(cont_table)[0]
        dm.loc[var2, var1] = chi2_stat
        dm.loc[var1, var2] = chi2_stat
    dm.fillna(0, inplace=True)
    return dm

get_corr_mat(df) 

正如我之前所说,这确实有效,但它可能会变慢并且未经测试.熊猫方法会更可取

As I've stated previously this does work, though it can get slow and is not tested. A pandas method would be much preferable

推荐答案

替代方法 1

另一种在列对之间查找卡方检验统计量以及热图可视化的方法:

Alternate Method 1

Another way to find chi-squared test statistic between pairs of columns along with heatmap visualisation:

def ch_calculate(df):
    factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values] 

    chi2, p_values =[], []

    for f in factors_paired:
        if f[0] != f[1]:
            chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]]))   
            chi2.append(chitest[0])
            p_values.append(chitest[1])
        else:      # for same factor pair
            chi2.append(0)
            p_values.append(0)

    chi2 = np.array(chi2).reshape((len(df.columns),len(df.columns))) # shape it as a matrix
    chi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience
    fig, ax = plt.subplots(figsize=(30,30))
    sns.heatmap(chi2, annot = True)
    plt.show()

ch_calculate(df_categorical)

其中 df_categorical 是一个包含数据集所有名义输入变量的数据框,对于序数分类变量,我认为最好使用 .corr(method='spearman')(斯皮尔曼等级相关系数)

Where df_categorical is a dataframe with all nominal input variables of a dataset, for ordinal categorical variables I think it is better to use .corr(method='spearman') (spearman rank correlation coefficient)

我还遇到了这个 Cramers V 实现来查找分类变量之间的关联程度:分类特征相关性通过使用它,我创建了另一个函数来创建热图可视化以查找相关的分类列(在 Cramers V 中,您将在热图中找到从 0 到 1 的值,其中 0 表示没有关联,1 表示高度关联)

Also I came across this Cramers V implementation to find degree of association between categorical variables: Categorical features correlation By using this, I created another function to create heatmap visualisation to find correlated categorical columns (In Cramers V, you will find values from 0 to 1 in heatmap where 0 means no association and 1 mean high association)

from itertools import combinations
from scipy.stats import chi2_contingency
import scipy.stats as ss
import seaborn as sns
def get_corr_mat(df, f=chi2_contingency):
        columns = df.columns
        dm = pd.DataFrame(index=columns, columns=columns)
        for var1, var2 in combinations(columns, 2):
            cont_table = pd.crosstab(df[var1], df[var2], margins=False)
            chi2_stat = cramers_v(cont_table.values)
            dm.loc[var2, var1] = chi2_stat
            dm.loc[var1, var2] = chi2_stat
        dm.fillna(1, inplace=True)
        return dm

def cramers_v(confusion_matrix):
        """ calculate Cramers V statistic for categorial-categorial association.
            uses correction from Bergsma and Wicher,
            Journal of the Korean Statistical Society 42 (2013): 323-328
        """
        chi2 = ss.chi2_contingency(confusion_matrix)[0]
        n = confusion_matrix.sum()
        phi2 = chi2 / n
        r, k = confusion_matrix.shape
        phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
        rcorr = r - ((r-1)**2)/(n-1)
        kcorr = k - ((k-1)**2)/(n-1)
        return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

cat_corr= get_corr_mat(df_categorical)
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(cat_corr, annot = True)
plt.show()

    
    

这篇关于计算列对之间的卡方的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆