从DataFrame中删除高度相关的列 [英] Remove strongly correlated columns from DataFrame

查看:45
本文介绍了从DataFrame中删除高度相关的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的DataFrame

I have a DataFrame like this

dict_ = {'Date':['2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-05'],'Col1':[1,2,3,4,5],'Col2':[1.1,1.2,1.3,1.4,1.5],'Col3':[0.33,0.98,1.54,0.01,0.99]}
df = pd.DataFrame(dict_, columns=dict_.keys())

然后我计算列之间的皮尔逊相关性,并过滤出超出我的阈值0.95的列

I then calculate the pearson correlation between the columns and filter out columns that are correlated above my threshold of 0.95

def trimm_correlated(df_in, threshold):
    df_corr = df_in.corr(method='pearson', min_periods=1)
    df_not_correlated = ~(df_corr.mask(np.eye(len(df_corr), dtype=bool)).abs() > threshold).any()
    un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
    df_out = df_in[un_corr_idx]
    return df_out

产生

uncorrelated_factors = trimm_correlated(df, 0.95)
print uncorrelated_factors

    Col3
0   0.33
1   0.98
2   1.54
3   0.01
4   0.99

到目前为止,我对结果感到满意,但是我想保留每个相关对中的一列,因此在上面的示例中,我想包括Col1或Col2.得到某物像这样

So far I am happy with the result, but I would like to keep one column from each correlated pair, so in the above example I would like to include Col1 or Col2. To get s.th. like this

    Col1   Col3
0    1     0.33
1    2     0.98
2    3     1.54
3    4     0.01
4    5     0.99

另外,我还能做进一步的评估来确定保留哪些相关列?

Also on a side note, is there any further evaluation I can do to determine which of the correlated columns to keep?

谢谢

推荐答案

您可以使用输出:

    Col1    Col3
0   1       0.33
1   2       0.98
2   3       1.54
3   4       0.01
4   5       0.99

这篇关于从DataFrame中删除高度相关的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆