为什么scipy.stats.nanmean从numpy.nansum得到不同的结果? [英] Why does scipy.stats.nanmean give different result from numpy.nansum?

查看:193
本文介绍了为什么scipy.stats.nanmean从numpy.nansum得到不同的结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 >>> import numpy as np 
>>> from scipy import stats
>>> a = np.r_ [1.,2.,np.nan,4.,5.]
>>> stats.nanmean(a)
2.9999999999999996
>>> np.nansum(a)/np.sum(〜np.isnan(a))
3.0

我意识到浮点表示的局限性。只是好奇,为什么更笨拙的表达似乎给更好的结果。

首先,这里是 scipy.nanmean()以便我们知道我们正在比较什么:

  def nanmean(x,axis = 0):
x,axis = _chk_asarray(x,axis)
x = x.copy()
Norig = x.shape [axis]
factor = (x,轴)* 1.0 / Norig

x [ / factor

在数学上,这两种方法是等价的。在数字上,它们是不同的。



你的方法涉及到一个单独的部门,它恰好是:


  • 分子( 1。+ 2. + 4. + 5。)可以完全表示为 float ;和

  • 分母( 4。)是2的幂次。


这意味着除法结果是确切的, 3。



<首先计算 [1,2,0,4,5。] code>,然后将其调整为考虑 NaNs 。碰巧,这意味着( 2.4 )不能完全表示为 float ,所以从计算的这一点是不精确的。



我还没有给它很多的想法,但是可以构建一个角色将被颠倒的例子, stats.nanmean()会比另一个方法更准确。

> stats.nanmean()
不会简单地做这样的事情:

  In [6]: np.mean(np.ma.MaskedArray(a,np.isnan(a)))
Out [6]:3.0

在我看来,这是一种比目前更好的方法。


>>> import numpy as np
>>> from scipy import stats
>>> a = np.r_[1., 2., np.nan, 4., 5.]
>>> stats.nanmean(a)
2.9999999999999996
>>> np.nansum(a)/np.sum(~np.isnan(a))
3.0

I'm aware of the limitation of floating point representation. Just curious why the more clumsy expression seems to give "better" result.

解决方案

First of all, here is scipy.nanmean() so that we know what we're comparing to:

def nanmean(x, axis=0):
    x, axis = _chk_asarray(x,axis)
    x = x.copy()
    Norig = x.shape[axis]
    factor = 1.0-np.sum(np.isnan(x),axis)*1.0/Norig

    x[np.isnan(x)] = 0
    return np.mean(x,axis)/factor

Mathematically, the two methods are equivalent. Numerically, they are different.

Your method involves a single division, and it so happens that:

  • the numerator (1. + 2. + 4. + 5.) can be represented exactly as a float; and
  • the denominator (4.) is a power of two.

This means that the result of the division is exact, 3..

stats.nanmean() involves first computing the mean of [1., 2., 0., 4., 5.], and then adjusting it to account for NaNs. As it happens, this mean (2.4) cannot be represented exactly as a float, so from this point on the computation is inexact.

I haven't given it a lot of thought, but it may be possible to construct an example where the roles would be reversed, and stats.nanmean() would give a more accurate result than the other method.

What surprises me is that stats.nanmean() doesn't simply do something like:

In [6]: np.mean(np.ma.MaskedArray(a, np.isnan(a)))
Out[6]: 3.0

This seems to me to be a superior approach to what it does currently.

这篇关于为什么scipy.stats.nanmean从numpy.nansum得到不同的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆