如何理解卡方列联表 [英] how to understand the chi square contingency table

查看:56
本文介绍了如何理解卡方列联表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我几乎没有分类特征:

['性别','已婚','家属','教育','自雇人士','Property_Area']从 scipy.stats 导入 chi2_contingencychi2, p, dof, 预期 = chi2_contingency((pd.crosstab(df.Gender, df.Married).values))打印(f'卡方统计量:{chi2},p值:{p}')

输出:

卡方统计量:79.63562874824729,p值:4.502328957824834e-19

如何从这些统计信息中知道这些特征是否相互独立?

我正在尝试构建一个分类模型,所以我只是想知道这些分类列对预测我的目标变量是否有用.

解决方案

列联表在统计中用于总结几个分类变量之间的关系.

在您的示例中,两个变量 GenderMarried 之间的列联表是这些变量同时显示的频率表.>

对列联表进行的卡方检验可以检验变量之间是否存在关系.这些影响被定义为行和列之间的关系.

<小时>

scipy.stats.chi2_contingency 计算 -默认- Pearson 卡方统计.

此外,我们对Sig(2-Tailed) 这是您示例中的 p 值.

p 值e 是反对零假设的证据.p 值越小证据表明你应该拒绝零假设.

在您的情况下,零假设是列联表中观察到的频率的依赖性.

<小时>

选择显着性水平 -alpha 作为 5%;你的p-value4.502328957824834e-19 远小于.05 说明列联表的行和列是独立.通常,这意味着解释列联表中的单元格是值得的.

在这种特殊情况下,这意味着男性女性(即性别)分布相似跨越不同的婚姻状况(即已婚、未婚).

所以,结婚可能是一种性别的地位高于另一种!

<小时>

更新

根据你的评论,我看到你对这个测试有些怀疑.

这个测试基本上告诉你变量之间的关系是显着(即可能代表总体)还是偶然

因此,如果您具有高水平的显着性(高 p 值),则意味着变量之间存在显着的依赖性!

现在,如果 GenderMarried 都是您模型中的特征,则可能会导致过拟合和特征冗余.然后,您可能需要选择其中之一.

但如果 GenderMarried 是因变量(如 y),那么它们之间存在显着关系就好了.

额外奖励:有时,其中一个特征在数据插补<期间暂时成为因变量/a>(当您有缺失值时).

I have few categorical features:

['Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'Property_Area']

from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency((pd.crosstab(df.Gender, df.Married).values))
print (f'Chi-square Statistic : {chi2} ,p-value: {p}')

output:

Chi-square Statistic : 79.63562874824729 ,p-value: 4.502328957824834e-19

How can I know if the features are independent from each other from these statistics?

I am trying to build a classification model so I just wanted to know are these categorical columns useful for predicting my target variable.

解决方案

Contingency tables are used in statistics to summarize the relationship between several categorical variables.

In your example, The Contingency table between the two variables Genderand Married is a Frequency table of these variables presented simultaneously.

A chi-squared test conducted on a contingency table can test whether or not a relationship exists between variables. These effects are defined as relationships between rows and columns.


scipy.stats.chi2_contingency computes -by default- Pearson’s chi-squared statistic.

Moreover,we are interested in the Sig(2-Tailed) which is the p-value in your example.

The p-value is the evidence against a null hypothesis. The smaller the p-value, the strong the evidence that you should reject the null hypothesis.

And the null hypothesis in your case is the dependence of the observed frequencies in the contingency table.


Choosing Significant Level -alpha as 5%; your p-value is 4.502328957824834e-19 is much less than .05 indicating that the rows and columns of the contingency table are independent. Generally this means that it is worthwhile to interpret the cells in the contingency table.

In this particular case it means that being Male or Female (i.e. Gender) is not distributed similarly across the different levels of Marital Status (i.e. Married, Not-Married).

So, being married may be the status of one gender more than the other!


Update

According to your comment, I see you have some doubts about this test.

This test basically tells you if the relationship between variables is Significant (i.e. may represent the population) or came by chance!

So if you have high level of Significance (high p-value), that means there's a significant dependency between the variables!

Now, if Gender and Married are both features in your model, that may lead to over-fitting and features redundancy. Then, you may want to choose one of them.

But if Gender or Married is the dependent variable (like y), then it's good they have significant relationship.

Extra bonus: Sometimes one of the features become temporarily a dependent variable during Data Imputation (when you have missing values).

这篇关于如何理解卡方列联表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆