dataframe.mean()的结果不正确 [英] The result of dataframe.mean() is incorrect

查看:319
本文介绍了dataframe.mean()的结果不正确的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python 2.7中是workint,我有一个数据框,并且我想获取称为'c'的列的平均值,但是只有能验证另一列中的值等于某个值的行. 当我执行代码时,答案是意外的,但是当我执行计算时,计算中位数时,结果是正确的.

为什么平均值输出不正确?

代码如下:

 df = pd.DataFrame(
    np.array([['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]]), 
    columns=['a', 'b', 'c', 'd']
)
 

 df
mean1 = df[df.a == 'A'].c.mean()
mean2 = df[df.a == 'B'].c.mean()

median1 = df[df.a == 'A'].c.median()
median2 = df[df.a == 'B'].c.median()
 

输出:

 df
Out[1]: 
   a  b  c    d
0  A  1  2    3
1  A  4  5  nan
2  A  7  8    9
3  B  3  2  nan
4  B  5  6  nan
5  B  5  6  nan
 

 mean1
Out[2]: 86.0

mean2
Out[3]: 88.66666666666667

median1
Out[4]: 5.0

median2
Out[5]: 6.0
 

很明显,均值的输出是不正确的.

谢谢.

解决方案

Pandas在计算均值时正在对"sum"进行字符串连接,这很容易从示例框架中看到.


>>> df[df.a == 'B'].c
3    2
4    6
5    6
Name: c, dtype: object
>>> 266 / 3
88.66666666666667

如果查看DataFrame的dtype,即使没有一个Series包含混合类型,您也会注意到它们都是object.这是由于numpy数组的声明所致.数组并不意味着包含异构类型,因此数组默认为dtype object,然后将其传递给DataFrame构造函数.您可以通过向构造函数传递一个列表来避免此行为,该列表可以容纳不同的dtype,而不会出现问题.


df = pd.DataFrame(
    [['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
    columns=['a', 'b', 'c', 'd']
)

df[df.a == 'B'].c.mean()

4.666666666666667


In [17]: df.dtypes
Out[17]:
a     object
b      int64
c      int64
d    float64
dtype: object


我仍然无法想象这种行为是故意的,因此,我认为值得在熊猫开发页面上打开问题报告,但总的来说,您不应该使用object dtype系列进行数值计算. /p>

I am workint in Python 2.7 and I have a data frame and I want to get the average of the column called 'c', but only the rows that verify that the values in another column are equal to some value. When I execute the code, the answer is unexpected, but when I execute the calculation, calculating the median, the result is correct.

Why is the output of the mean incorrect?

The code is the following:

df = pd.DataFrame(
    np.array([['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]]), 
    columns=['a', 'b', 'c', 'd']
)

df
mean1 = df[df.a == 'A'].c.mean()
mean2 = df[df.a == 'B'].c.mean()

median1 = df[df.a == 'A'].c.median()
median2 = df[df.a == 'B'].c.median()

The output:

df
Out[1]: 
   a  b  c    d
0  A  1  2    3
1  A  4  5  nan
2  A  7  8    9
3  B  3  2  nan
4  B  5  6  nan
5  B  5  6  nan

mean1
Out[2]: 86.0

mean2
Out[3]: 88.66666666666667

median1
Out[4]: 5.0

median2
Out[5]: 6.0

It is obvious that the output of the mean is incorrect.

Thanks.

解决方案

Pandas is doing string concatenation for the "sum" when calculating the mean, this is plain to see from your example frame.


>>> df[df.a == 'B'].c
3    2
4    6
5    6
Name: c, dtype: object
>>> 266 / 3
88.66666666666667

If you look at the dtype's for your DataFrame, you'll notice that all of them are object, even though no single Series contains mixed types. This is due to the declaration of your numpy array. Arrays are not meant to contain heterogenous types, so the array defaults to dtype object, which is then passed to the DataFrame constructor. You can avoid this behavior by passing the constructor a list instead, which can hold differing dtype's with no issues.


df = pd.DataFrame(
    [['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
    columns=['a', 'b', 'c', 'd']
)

df[df.a == 'B'].c.mean()

4.666666666666667


In [17]: df.dtypes
Out[17]:
a     object
b      int64
c      int64
d    float64
dtype: object


I still can't imagine that this behavior is intended, so I believe it's worth opening an issue report on the pandas development page, but in general, you shouldn't be using object dtype Series for numeric calculations.

这篇关于dataframe.mean()的结果不正确的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆