如何使用python或pandas计算所有列之间的相关性并删除高度相关的列 [英] How to calculate correlation between all columns and remove highly correlated ones using python or pandas
本文介绍了如何使用python或pandas计算所有列之间的相关性并删除高度相关的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个庞大的数据集,并且在机器学习建模之前,总是建议您首先删除高度相关的描述符(列),我该如何计算列的相关性并删除具有阈值的列,例如删除所有具有> 0.8相关性的列或描述符.还要保留标头中的reduce数据.
I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say remove all the columns or descriptors having >0.8 correlation. also it should retained the headers in reduce data..
示例数据集
GA PN PC MBP GR AP
0.033 6.652 6.681 0.194 0.874 3.177
0.034 9.039 6.224 0.194 1.137 3.4
0.035 10.936 10.304 1.015 0.911 4.9
0.022 10.11 9.603 1.374 0.848 4.566
0.035 2.963 17.156 0.599 0.823 9.406
0.033 10.872 10.244 1.015 0.574 4.871
0.035 21.694 22.389 1.015 0.859 9.259
0.035 10.936 10.304 1.015 0.911 4.5
请帮助....
推荐答案
这是我使用的方法-
def correlation(dataset, threshold):
col_corr = set() # Set of all the names of deleted columns
corr_matrix = dataset.corr()
for i in range(len(corr_matrix.columns)):
for j in range(i):
if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
colname = corr_matrix.columns[i] # getting the name of column
col_corr.add(colname)
if colname in dataset.columns:
del dataset[colname] # deleting the column from the dataset
print(dataset)
希望这会有所帮助!
这篇关于如何使用python或pandas计算所有列之间的相关性并删除高度相关的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文