使用大 pandas 计算克拉美尔系数矩阵 [英] Using pandas, calculate Cramér's coefficient matrix

查看:140
本文介绍了使用大 pandas 计算克拉美尔系数矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在pandas中有一个数据框,其中包含根据Wikipedia文章计算的指标.两个类别变量nation文章涉及的国家/地区,以及lang维基百科使用的语言.对于单个指标,我想了解国家和语言变量之间的相关程度,我相信这是使用Cramer的统计数据完成的.

I have a dataframe in pandas which contains metrics calculated on Wikipedia articles. Two categorical variables nation which nation the article is about, and lang which language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.

index   qid     subj    nation  lang    metric          value
5   Q3488399    economy     cdi     fr  informativeness 0.787117
6   Q3488399    economy     cdi     fr  referencerate   0.000945
7   Q3488399    economy     cdi     fr  completeness    43.200000
8   Q3488399    economy     cdi     fr  numheadings     11.000000
9   Q3488399    economy     cdi     fr  articlelength   3176.000000
10  Q7195441    economy     cdi     en  informativeness 0.626570
11  Q7195441    economy     cdi     en  referencerate   0.008610
12  Q7195441    economy     cdi     en  completeness    6.400000
13  Q7195441    economy     cdi     en  numheadings     7.000000
14  Q7195441    economy     cdi     en  articlelength   2323.000000

我想生成一个矩阵,该矩阵显示国家(法国,美国,科特迪瓦和乌干达)和所有三种语言的所有组合之间的克莱默系数.所以会有一个4×3矩阵,如:

I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga'] and three languages ['fr','en','sw']. So there would be a resulting 4 by 3 matrix like:

       en         fr          sw
usa    Cramer11   Cramer12    ... 
fra    Cramer21   Cramer22    ... 
cdi    ...
uga    ...

最终,我将对所跟踪的所有不同指标执行此操作.

Eventually then I will do this over all the different metrics I am tracking.

for subject in list_of_subjects:
    for metric in list_of_metrics:
        cramer_matrix(metric, df)

然后,我可以检验我的假设,即语言为Wikipedia的文章的指标会更高.谢谢

Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia. Thanks

推荐答案

cramers V在我进行的一些测试中似乎过于乐观.维基百科建议更正版本.

cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

还请注意,可以通过内置的pandas方法对分类列进行混淆矩阵计算:

Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:

import pandas as pd
confusion_matrix = pd.crosstab(df[column1], df[column2])

这篇关于使用大 pandas 计算克拉美尔系数矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆