指定“跳过NA";在计算由Pandas创建的数据框中的列的均值时 [英] specifying "skip NA" when calculating mean of the column in a data frame created by Pandas

查看:241
本文介绍了指定“跳过NA";在计算由Pandas创建的数据框中的列的均值时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过复制某些R小插曲的郊游来学习Pandas包.现在,我以R中的dplyr包为例:

I am learning Pandas package by replicating the outing from some of the R vignettes. Now I am using the dplyr package from R as an example:

http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

planes <- group_by(hflights_df, TailNum)
delay <- summarise(planes,
  count = n(),
  dist = mean(Distance, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

Python脚本

planes = hflights.groupby('TailNum')
planes['Distance'].agg({'count' : 'count',
                        'dist' : 'mean'})

如何在python中明确声明NA需要跳过?

How can I state explicitly in python that NA needs to be skipped?

推荐答案

这是一个棘手的问题,因为您不这样做.熊猫会自动从聚合函数中排除NaN数字.考虑我的df:

That's a trick question, since you don't do that. Pandas will automatically exclude NaN numbers from aggregation functions. Consider my df:

    b   c   d  e
a               
2   2   6   1  3
2   4   8 NaN  7
2   4   4   6  3
3   5 NaN   2  6
4 NaN NaN   4  1
5   6   2   1  8
7   3   2   4  7
9   6   1 NaN  1
9 NaN NaN   9  3
9   3   4   6  1

内部count()函数将忽略NaN值,mean()也将忽略.获得NaN的唯一点是唯一的值是NaN .然后,我们取一个空集的平均值,结果为NaN:

The internal count() function will ignore NaN values, and so will mean(). The only point where we get NaN, is when the only value is NaN. Then, we take the mean value of an empty set, which turns out to be NaN:

In[335]: df.groupby('a').mean()
Out[333]: 
          b    c    d         e
a                              
2  3.333333  6.0  3.5  4.333333
3  5.000000  NaN  2.0  6.000000
4       NaN  NaN  4.0  1.000000
5  6.000000  2.0  1.0  8.000000
7  3.000000  2.0  4.0  7.000000
9  4.500000  2.5  7.5  1.666667

集合函数的工作方式相同:

Aggregate functions work in the same way:

In[340]: df.groupby('a')['b'].agg({'foo': np.mean})
Out[338]: 
        foo
a          
2  3.333333
3  5.000000
4       NaN
5  6.000000
7  3.000000
9  4.500000

附录:请注意标准

Addendum: Notice how the standard dataframe.mean API will allow you to control inclusion of NaN values, where the default is exclusion.

这篇关于指定“跳过NA";在计算由Pandas创建的数据框中的列的均值时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆