pandas corr()经常返回NaN [英] Pandas corr() returning NaN too often

查看:369
本文介绍了 pandas corr()经常返回NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在数据帧上运行我认为应该是简单的相关函数的函数,但是它在我认为不应该的地方返回NaN.

I'm attempting to run what I think should be a simple correlation function on a dataframe but it is returning NaN in places where I don't believe it should.

代码:

# setup
import pandas as pd
import io

csv = io.StringIO(u'''
id  date    num
A   2018-08-01  99
A   2018-08-02  50
A   2018-08-03  100
A   2018-08-04  100
A   2018-08-05  100
B   2018-07-31  500
B   2018-08-01  100
B   2018-08-02  100
B   2018-08-03  0
B   2018-08-05  100
B   2018-08-06  500
B   2018-08-07  500
B   2018-08-08  100
C   2018-08-01  100
C   2018-08-02  50
C   2018-08-03  100
C   2018-08-06  300
''')

df = pd.read_csv(csv, sep = '\t')

# Format manipulation
df = df[df['num'] > 50]
df = df.pivot(index = 'date', columns = 'id', values = 'num')
df = pd.DataFrame(df.to_records())

# Main correlation calculations
print df.iloc[:, 1:].corr()

主题DataFrame:

       A      B      C
0    NaN  500.0    NaN
1   99.0  100.0  100.0
2    NaN  100.0    NaN
3  100.0    NaN  100.0
4  100.0    NaN    NaN
5  100.0  100.0    NaN
6    NaN  500.0  300.0
7    NaN  500.0    NaN
8    NaN  100.0    NaN

corr()结果:

    A    B    C
A  1.0  NaN  NaN
B  NaN  1.0  1.0
C  NaN  1.0  1.0

根据(有限的)文档在该函数上,应排除"NA/空值".由于每一列都有重叠的值,因此结果是否应全部不是非NaN?

According to the (limited) documentation on the function, it should exclude "NA/null values". Since there are overlapping values for each column, should the result not all be non-NaN?

此处此处,但都没有回答我的问题.我已经尝试过在此处讨论过的float64想法,但也失败了.

There are good discussions here and here, but neither answered my question. I've tried the float64 idea discussed here, but that failed as well.

@hellpanderr的评论提出了一个很好的观点,我使用的是0.22.0

@hellpanderr's comment brought up a good point, I'm using 0.22.0

奖金问题-我不是数学家,但是在这个结果中B和C之间如何存在1:1的相关性?

推荐答案

结果似乎是您处理的数据的伪影.在撰写本文时,NA被忽略,因此基本上可以归结为:

The result seems to be an artefact of the data you work with. As you write, NAs are ignored, so it basically boils down to:

df[['B', 'C']].dropna()

       B      C
1  100.0  100.0
6  500.0  300.0

因此,每列只剩下两个值可用于计算,因此应

So, there are only two values per column left for the calculation which should therefore lead to to correlation coefficients of 1:

df[['B', 'C']].dropna().corr()

     B    C
B  1.0  1.0
C  1.0  1.0

那么,NA的其余组合来自何处?

So, where do the NAs then come from for the remaining combinations?

df[['A', 'B']].dropna()

       A      B
1   99.0  100.0
5  100.0  100.0


df[['A', 'C']].dropna()

       A      C
1   99.0  100.0
3  100.0  100.0

所以,同样在这里,您最终每列只有两个值.区别在于,列BC仅包含一个值(100),该值的标准偏差为0:

So, also here you end up with only two values per column. The difference is that the columns B and C contain only one value (100) which gives a standard deviation of 0:

df[['A', 'C']].dropna().std()

A    0.707107
C    0.000000

计算相关系数时,用标准偏差除以NA.

When the correlation coefficient is calculated, you divide by the standard deviation, which leads to a NA.

这篇关于 pandas corr()经常返回NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆