使用大 pandas 计算克拉美尔系数矩阵 [英] Using pandas, calculate Cramér's coefficient matrix
问题描述
我在pandas
中有一个数据框,其中包含根据Wikipedia文章计算的指标.两个类别变量nation
文章涉及的国家/地区,以及lang
维基百科使用的语言.对于单个指标,我想了解国家和语言变量之间的相关程度,我相信这是使用Cramer的统计数据完成的.
I have a dataframe in pandas
which contains metrics calculated on Wikipedia articles. Two categorical variables nation
which nation the article is about, and lang
which language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.
index qid subj nation lang metric value
5 Q3488399 economy cdi fr informativeness 0.787117
6 Q3488399 economy cdi fr referencerate 0.000945
7 Q3488399 economy cdi fr completeness 43.200000
8 Q3488399 economy cdi fr numheadings 11.000000
9 Q3488399 economy cdi fr articlelength 3176.000000
10 Q7195441 economy cdi en informativeness 0.626570
11 Q7195441 economy cdi en referencerate 0.008610
12 Q7195441 economy cdi en completeness 6.400000
13 Q7195441 economy cdi en numheadings 7.000000
14 Q7195441 economy cdi en articlelength 2323.000000
我想生成一个矩阵,该矩阵显示国家(法国,美国,科特迪瓦和乌干达)和所有三种语言的所有组合之间的克莱默系数.所以会有一个4×3矩阵,如:
I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga']
and three languages ['fr','en','sw']
. So there would be a resulting 4 by 3 matrix like:
en fr sw
usa Cramer11 Cramer12 ...
fra Cramer21 Cramer22 ...
cdi ...
uga ...
最终,我将对所跟踪的所有不同指标执行此操作.
Eventually then I will do this over all the different metrics I am tracking.
for subject in list_of_subjects:
for metric in list_of_metrics:
cramer_matrix(metric, df)
然后,我可以检验我的假设,即语言为Wikipedia的文章的指标会更高.谢谢
Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia. Thanks
推荐答案
cramers V在我进行的一些测试中似乎过于乐观.维基百科建议更正版本.
cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.
def cramers_corrected_stat(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
还请注意,可以通过内置的pandas方法对分类列进行混淆矩阵计算:
Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:
import pandas as pd
confusion_matrix = pd.crosstab(df[column1], df[column2])
这篇关于使用大 pandas 计算克拉美尔系数矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!