pandas 和numpy的平均值不同 [英] mean from pandas and numpy differ

查看:248
本文介绍了 pandas 和numpy的平均值不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个MEMS IMU,一直在收集数据,并且正在使用熊猫从中获取一些统计数据.每个周期收集6个32位浮点数.给定收集运行的数据速率是固定的.数据速率在100Hz至1000Hz之间变化,采集时间长达72小时.数据保存在平面二进制文件中.我这样读取数据:

I have a MEMS IMU on which I've been collecting data and I'm using pandas to get some statistical data from it. There are 6 32-bit floats collected each cycle. Data rates are fixed for a given collection run. The data rates vary between 100Hz and 1000Hz and the collection times run as long as 72 hours. The data is saved in a flat binary file. I read the data this way:

import numpy as np
import pandas as pd
dataType=np.dtype([('a','<f4'),('b','<f4'),('c','<f4'),('d','<f4'),('e','<f4'),('e','<f4')])
df=pd.DataFrame(np.fromfile('FILENAME',dataType))
df['c'].mean()
-9.880581855773926
x=df['c'].values
x.mean()
-9.8332081

-9.833是正确的结果.我可以创建类似的结果,使某人可以重复这种方式:

-9.833 is the correct result. I can create a similar result that someone should be able to repeat this way:

import numpy as np
import pandas as pd
x=np.random.normal(-9.8,.05,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-9.859579086303711
x.mean()
-9.8000648778888628

我已经在Linux和Windows,AMD和Intel处理器以及Python 2.7和3.5中重复了这一步骤.我很困惑我究竟做错了什么? 并得到这个:

I've repeated this on linux and windows, on AMD and Intel processors, in Python 2.7 and 3.5. I'm stumped. What am I doing wrong? And get this:

x=np.random.normal(-9.,.005,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-8.999998092651367
x.mean()
-9.0000075889406528

我可以接受这种区别.这是32位浮点数精度的极限.

I could accept this difference. It's at the limit of the precision of 32 bit floats.

没关系.我在星期五写了这篇文章,而解决方案今天早上就打了.大量数据加剧了浮点精度问题.我需要通过以下方式在创建数据帧时将数据转换为64位浮点数:

NEVERMIND. I wrote this on Friday and the solution hit me this morning. It is a floating point precision problem exacerbated by the large amount of data. I needed to convert the data into 64 bit float on the creation of the dataframe this way:

df=pd.DataFrame(np.fromfile('FILENAME',dataType),dtype='float64')

如果其他人遇到类似问题,我将离开该职位.

I'll leave the post should anyone else run into a similar issue.

推荐答案

简短版本:

之所以不同,是因为pandas在调用mean操作时使用bottleneck(如果已安装),而不是仅仅依赖numpy.可能使用了bottleneck,因为它似乎比numpy更快(至少在我的机器上),但以精度为代价.它们恰好与64位版本相匹配,但在32位区域方面有所不同(这是有趣的部分).

Short version:

The reason it's different is because pandas uses bottleneck (if it's installed) when calling the mean operation, as opposed to just relying on numpy. bottleneck is presumably used since it appears to be faster than numpy (at least on my machine), but at the cost of precision. They happen to match for the 64 bit version, but differ in 32 bit land (which is the interesting part).

仅通过检查这些模块的源代码来判断正在发生的事情是非常困难的(它们非常复杂,即使对于像mean这样的简单计算,也很难进行数值计算).最好使用调试器来避免进行大脑编译和此类错误.调试器不会在逻辑上犯错,它会告诉您准确发生了什么事情.

It's extremely difficult to tell what's going on just by inspecting the source code of these modules (they're quite complex, even for simple computations like mean, turns out numerical computing is hard). Best to use the debugger to avoid brain-compiling and those types of mistakes. The debugger won't make a mistake in logic, it'll tell you exactly what's going on.

这是我的一些堆栈跟踪信息(由于RNG没有种子,因此值略有不同):

Here's some of my stack trace (values differ slightly since no seed for RNG):

可以复制(Windows):

>>> import numpy as np; import pandas as pd
>>> x=np.random.normal(-9.,.005,size=900000)
>>> df=pd.DataFrame(x,dtype='float32',columns=['x'])
>>> df['x'].mean()
-9.0
>>> x.mean()
-9.0000037501099754
>>> x.astype(np.float32).mean()
-9.0000029

numpy的版本没有什么特别的事情. pandas版本有点古怪.

Nothing extraordinary going on with numpy's version. It's the pandas version that's a little wacky.

让我们来看看df['x'].mean():

Let's have a look inside df['x'].mean():

>>> def test_it_2():
...   import pdb; pdb.set_trace()
...   df['x'].mean()
>>> test_it_2()
... # Some stepping/poking around that isn't important
(Pdb) l
2307
2308            if we have an ndarray as a value, then simply perform the operation,
2309            otherwise delegate to the object
2310
2311            """
2312 ->         delegate = self._values
2313            if isinstance(delegate, np.ndarray):
2314                # Validate that 'axis' is consistent with Series's single axis.
2315                self._get_axis_number(axis)
2316                if numeric_only:
2317                    raise NotImplementedError('Series.{0} does not implement '
(Pdb) delegate.dtype
dtype('float32')
(Pdb) l
2315                self._get_axis_number(axis)
2316                if numeric_only:
2317                    raise NotImplementedError('Series.{0} does not implement '
2318                                              'numeric_only.'.format(name))
2319                with np.errstate(all='ignore'):
2320 ->                 return op(delegate, skipna=skipna, **kwds)
2321
2322            return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,
2323                                    numeric_only=numeric_only,
2324                                    filter_type=filter_type, **kwds)

所以我们找到了麻烦所在,但是现在事情变得有些奇怪了:

So we found the trouble spot, but now things get kind of weird:

(Pdb) op
<function nanmean at 0x000002CD8ACD4488>
(Pdb) op(delegate)
-9.0
(Pdb) delegate_64 = delegate.astype(np.float64)
(Pdb) op(delegate_64)
-9.000003749978807
(Pdb) delegate.mean()
-9.0000029
(Pdb) delegate_64.mean()
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float64)
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float32)
-9.0000029

请注意,delegate.mean()np.nanmeanpandas nanmean一样,输出类型为float32-9.0000029不是 -9.0.经过一番摸索,您可以在pandas.core.nanops中找到pandas nanmean的源.有趣的是,它实际上看起来像是应该首先匹配numpy.让我们看一下pandas nanmean:

Note that delegate.mean() and np.nanmean output -9.0000029 with type float32, not -9.0 as pandas nanmean does. With a bit of poking around, you can find the source to pandas nanmean in pandas.core.nanops. Interestingly, it actually appears like it should be matching numpy at first. Let's have a look at pandas nanmean:

(Pdb) import inspect
(Pdb) src = inspect.getsource(op).split("\n")
(Pdb) for line in src: print(line)
@disallow('M8')
@bottleneck_switch()
def nanmean(values, axis=None, skipna=True):
    values, mask, dtype, dtype_max = _get_values(values, skipna, 0)

    dtype_sum = dtype_max
    dtype_count = np.float64
    if is_integer_dtype(dtype) or is_timedelta64_dtype(dtype):
        dtype_sum = np.float64
    elif is_float_dtype(dtype):
        dtype_sum = dtype
        dtype_count = dtype
    count = _get_counts(mask, axis, dtype=dtype_count)
    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))

    if axis is not None and getattr(the_sum, 'ndim', False):
        the_mean = the_sum / count
        ct_mask = count == 0
        if ct_mask.any():
            the_mean[ct_mask] = np.nan
    else:
        the_mean = the_sum / count if count > 0 else np.nan

    return _wrap_results(the_mean, dtype)

这是bottleneck_switch装饰器的(简短)版本:

Here's a (short) version of the bottleneck_switch decorator:

import bottleneck as bn
...
class bottleneck_switch(object):

    def __init__(self, **kwargs):
        self.kwargs = kwargs

    def __call__(self, alt):
        bn_name = alt.__name__

        try:
            bn_func = getattr(bn, bn_name)
        except (AttributeError, NameError):  # pragma: no cover
            bn_func = None
    ...

                if (_USE_BOTTLENECK and skipna and
                        _bn_ok_dtype(values.dtype, bn_name)):
                    result = bn_func(values, axis=axis, **kwds)

这是通过alt作为pandas nanmean函数调用的,因此bn_name'nanmean',这是从bottleneck模块获取的attr:

This is called with alt as the pandas nanmean function, so bn_name is 'nanmean', and this is the attr that's grabbed from the bottleneck module:

(Pdb) l
 93                             result = np.empty(result_shape)
 94                             result.fill(0)
 95                             return result
 96
 97                     if (_USE_BOTTLENECK and skipna and
 98  ->                         _bn_ok_dtype(values.dtype, bn_name)):
 99                         result = bn_func(values, axis=axis, **kwds)
100
101                         # prefer to treat inf/-inf as NA, but must compute the fun
102                         # twice :(
103                         if _has_infs(result):
(Pdb) n
> d:\anaconda3\lib\site-packages\pandas\core\nanops.py(99)f()
-> result = bn_func(values, axis=axis, **kwds)
(Pdb) alt
<function nanmean at 0x000001D2C8C04378>
(Pdb) alt.__name__
'nanmean'
(Pdb) bn_func
<built-in function nanmean>
(Pdb) bn_name
'nanmean'
(Pdb) bn_func(values, axis=axis, **kwds)
-9.0

假装bottleneck_switch()装饰器一秒钟不存在.我们实际上可以看到手动调用此功能(不带bottleneck)将获得与numpy相同的结果:

Pretend that bottleneck_switch() decorator doesn't exist for a second. We can actually see that calling that manually stepping through this function (without bottleneck) will get you the same result as numpy:

(Pdb) from pandas.core.nanops import _get_counts
(Pdb) from pandas.core.nanops import _get_values
(Pdb) from pandas.core.nanops import _ensure_numeric
(Pdb) values, mask, dtype, dtype_max = _get_values(delegate, skipna=skipna)
(Pdb) count = _get_counts(mask, axis=None, dtype=dtype)
(Pdb) count
900000.0
(Pdb) values.sum(axis=None, dtype=dtype) / count
-9.0000029

但是,如果您已安装bottleneck,则永远不会调用它.取而代之的是,bottleneck_switch()装饰器使用bottleneck的版本覆盖nanmean函数.这就是差异所在(有趣的是,它与float64情况相匹配):

That never gets called, though, if you have bottleneck installed. Instead, the bottleneck_switch() decorator instead blasts over the nanmean function with bottleneck's version. This is where the discrepancy lies (interestingly it matches on the float64 case, though):

(Pdb) import bottleneck as bn
(Pdb) bn.nanmean(delegate)
-9.0
(Pdb) bn.nanmean(delegate.astype(np.float64))
-9.000003749978807

据我所知,

bottleneck仅用于速度.我假设他们正在使用其nanmean函数进行某种类型的快捷方式,但是我并没有对其进行过多研究(有关此主题的详细信息,请参见@ead的回答).您可以看到它们通常比基准测试numpy快一点: https://github.com/kwgoodman/瓶颈.显然,为此速度付出的代价是精度.

bottleneck is used solely for speed, as far as I can tell. I'm assuming they're taking some type of shortcut with their nanmean function, but I didn't look into it much (see @ead's answer for details on this topic). You can see that it's typically a bit faster than numpy by their benchmarks: https://github.com/kwgoodman/bottleneck. Clearly the price to pay for this speed is precision.

瓶颈实际上快了吗?

肯定看起来像(至少在我的机器上).

Sure looks like it (at least on my machine).

In [1]: import numpy as np; import pandas as pd

In [2]: x=np.random.normal(-9.8,.05,size=900000)

In [3]: y_32 = x.astype(np.float32)

In [13]: %timeit np.nanmean(y_32)
100 loops, best of 3: 5.72 ms per loop

In [14]: %timeit bn.nanmean(y_32)
1000 loops, best of 3: 854 µs per loop

对于pandas在此处引入一个标志可能会很好(一个用于速度,另一个用于提高精度,默认情况下是用于速度,因为这是当前的实现).一些用户更关心计算的准确性,而不是计算的速度.

It might be nice for pandas to introduce a flag here (one for speed, the other for better precision, default is for speed since that's the current impl). Some users care much more about the accuracy of the computation than the speed at which it happens.

HTH.

这篇关于 pandas 和numpy的平均值不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆