当我在Pandas中使用df.corr时,某些列会丢失 [英] Some of my columns get missing when I use df.corr in Pandas

查看:571
本文介绍了当我在Pandas中使用df.corr时,某些列会丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的代码:






 将numpy导入为np 
进口熊猫为pd
进口seaborn为sns
进口matplotlib.pyplot为plt

data = pd.read_csv('death_regression2.csv')
data3 =数据.replace(r'\s +',np.nan,regex = True)


plt.figure(figsize =(90,90))
corr = data3。 corr()

print(np.shape(list(corr)))
print(np.shape(data3))






(135,)
(4909,204)



因此,在我使用相关函数之前,参数总数为204(列数)
,但是在使用data3.corr()之后,一些参数丢失了,减少到135。



如何检查数据中所有列之间的相关性?

解决方案

无看到任何其他数据以了解为什么您缺少列,我们将必须检查



这有助于突出显示不同的相关性: x y


Here is my code:


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('death_regression2.csv')
data3 = data.replace(r'\s+', np.nan, regex = True)  


plt.figure(figsize=(90,90)) 
corr = data3.corr()

print(np.shape(list(corr)))
print(np.shape(data3))


(135,) (4909, 204)

So before I use the correlation function, the total number of parameters was 204(number of the columns) but after using data3.corr(), some parameters go missing, reduced to 135.

How do check the correlation between all columns in the data?

解决方案

Without seeing any additional data to understand why you are missing columns, we will have to inspect what pd.DataFrame.corr does.

As the documentation outlines it computes the pairwise correlations of columns. Because you specified no arguments is uses the default method and calculate Pearson's r, which measures the linear correlation between two variables (X, Y) and can take values between -1 and 1 corresponding to an exact negative linear correlation to an exact positive linear correlation and all the values in between, with 0 being no correlation (i.e., the plot of X against Y is a random and a linear regression would fit a flat slope).

For non-numerical variables, there is no concept of correlation (at least within the context of Pearson's r and this answer) and pd.DataFrame.corr simply ignores non-numerical (i.e., non-float or non-integer values) and drops these columns, explaining why you have less columns.

If your dropped values are in fact numerical but stored (for example) as strings, you probably need to convert them before calling .corr().

As an example:

x = np.random.rand(10)
y = np.random.rand(10)
x_scaled = x*6 
cat = ['one', 'two', 'three', 'four', 'five', 
       'six','seven', 'eight', 'nine', 'ten']

df = pd.DataFrame({'x':x, 'y':y, 'x_s':x_scaled, 'cat':cat})

df.corr()

returns:

        x            y          x_s
 x   1.000000    -0.470699    1.000000
 y  -0.470699     1.000000   -0.470699
x_s  1.000000    -0.470699    1.000000

which is our correlation matrix but our non-numerical column (cat) has been dropped.

If you plot the different numerical variables against each other you get the below plot:

which helps highlight the different correlations: by chance there is a negative linear correlation between x and y.

这篇关于当我在Pandas中使用df.corr时,某些列会丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆