分类特征相关 [英] Categorical features correlation
问题描述
我的数据中有一些连续的分类特征.对类别特征进行热编码以使其与其他连续生物一起与标签相关联,这是一个好主意还是绝对坏主意?
I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
推荐答案
有一种无需对类别变量进行一次热编码就可以计算相关系数的方法. Cramers V统计量是一种用于计算分类变量的相关性的方法.可以如下计算.以下链接很有帮助. 使用熊猫,计算Cramér系数矩阵对于其他连续变量值,可以使用pandas
的cut
进行分类.
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut
of pandas
.
import pandas as pd
import numpy as np
import scipy.stats as ss
import seaborn as sns
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
np.arange(0, 55, 5),
include_lowest=True,
right=False)
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
confusion_matrix = pd.crosstab(tips["day"], tips["time"]).as_matrix()
cramers_v(confusion_matrix)
# Out[10]: 0.93866193407222209
confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"]).as_matrix()
cramers_v(confusion_matrix)
# Out[24]: 0.16498707494988371
这篇关于分类特征相关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!