NumPy-在掩码数组上更快的操作? [英] NumPy - Faster Operations on Masked Array?

查看:67
本文介绍了NumPy-在掩码数组上更快的操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个numpy数组:

I have a numpy array:

import numpy as np
arr = np.random.rand(100)

如果要查找其最大值,请运行 np.amax ,它在计算机上每秒运行 155,357 次.

If I want to find its maximum value, I run np.amax which runs 155,357 times a second on my machine.

但是,由于某些原因,我不得不掩盖其某些价值观.例如,让我们仅遮盖一个单元格:

However, for some reasons, I have to mask some of its values. Lets, for example, mask just one cell:

import numpy.ma as ma
arr = ma.masked_array(arr, mask=[0]*99 + [1])

现在,找到最大值要慢得多,每秒运行 26,574 次.

Now, finding the max is much slower, running 26,574 times a second.

这仅是非屏蔽阵列上此操作速度的 17%.

This is only 17% of the speed of this operation on a none-masked array.

其他操作,例如, add .尽管在带遮罩的阵列上它们可以按所有值进行操作,但与无遮罩的阵列(15,343/497,663)相比,它仅是速度的 4%

Other operations, for example, are the subtract, add, and multiply. Although on a masked array they operate on ALL OF THE VALUES, it is only 4% of the speed compared to a none-masked array (15,343/497,663)

我正在寻找一种更快的方法来处理这样的掩码数组,无论是否使用numpy.

I'm looking for a faster way to operate on masked arrays like this, whether its using numpy or not.

(我需要在真实数据上运行它,它是具有多个维度和数百万个单元的数组)

(I need to run this on real data, which is arrays with multiple dimensions, and millions of cells)

推荐答案

MaskedArray 是基本numpy ndarray 的子类.它没有自己的编译代码.查看 numpy/ma/目录以获取详细信息或主文件:

MaskedArray is a subclass of the base numpy ndarray. It does not have compiled code of its own. Look at the numpy/ma/ directory for details, or the main file:

/usr/local/lib/python3.6/dist-packages/numpy/ma/core.py

掩码数组必须具有关键属性 data mask ,一个是用于创建它的数据数组,另一个是相同大小的布尔数组.

A masked array has to key attributes, data and mask, one is the data array you used to create it, the other a boolean array of the same size.

因此,所有操作都必须考虑到这两个数组.它不仅要计算新的数据,还必须计算新的 mask .

So all operations have to take those two arrays into account. Not only does it calculate new data, it also has to calculate a new mask.

它可以采取几种方法(取决于操作):

It can take several approaches (depending on the operation):

  • 按原样使用 data

使用压缩的数据-一个删除了掩码值的新数组

use compressed data - a new array with the masked values removed

使用填充的 data ,其中,掩码值将替换为 fillvalue 或某些无害的值(例如,加法时为0,乘法时为1).

use filled data, where the masked values are replaced by the fillvalue or some innocuous value (e.g. 0 when doing addition, 1 when doing multiplication).

掩码值的数量(0或全部)对速度的影响很小(如果有的话).

The number of masked values, 0 or all, makes little, if any, difference is speed.

因此,您看到的速度差异不足为奇.正在进行很多额外的计算. ma.core.py 文件说,此软件包最初是在numpy之前开发的,并于2005年左右合并到 numpy 中.至今,我认为它没有得到重大改进.

So the speed differences that you see are not surprising. There's a lot of extra calculation going on. The ma.core.py file says this package was first developed in pre-numpy days, and incorporated into numpy around 2005. While there have been changes to keep it up to date, I don't think it has been significantly reworked.

这是 np.ma.max 方法的代码:

def max(self, axis=None, out=None, fill_value=None, keepdims=np._NoValue):

    kwargs = {} if keepdims is np._NoValue else {'keepdims': keepdims}

    _mask = self._mask
    newmask = _check_mask_axis(_mask, axis, **kwargs)
    if fill_value is None:
        fill_value = maximum_fill_value(self)
    # No explicit output
    if out is None:
        result = self.filled(fill_value).max(
            axis=axis, out=out, **kwargs).view(type(self))
        if result.ndim:
            # Set the mask
            result.__setmask__(newmask)
            # Get rid of Infs
            if newmask.ndim:
                np.copyto(result, result.fill_value, where=newmask)
        elif newmask:
            result = masked
        return result
    # Explicit output
    ....

关键步骤是

fill_value = maximum_fill_value(self)  # depends on dtype
self.filled(fill_value).max(
            axis=axis, out=out, **kwargs).view(type(self))

您可以尝试填充 来查看数组会发生什么.

You can experiment with filled to see what happens with your array.

In [40]: arr = np.arange(10.)                                                                                        
In [41]: arr                                                                                                         
Out[41]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
In [42]: Marr = np.ma.masked_array(arr, mask=[0]*9 + [1])                                                            
In [43]: Marr                                                                                                        
Out[43]: 
masked_array(data=[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, --],
             mask=[False, False, False, False, False, False, False, False,
                   False,  True],
       fill_value=1e+20)
In [44]: np.ma.maximum_fill_value(Marr)                                                                              
Out[44]: -inf
In [45]: Marr.filled()                                                                                               
Out[45]: 
array([0.e+00, 1.e+00, 2.e+00, 3.e+00, 4.e+00, 5.e+00, 6.e+00, 7.e+00,
       8.e+00, 1.e+20])
In [46]: Marr.filled(_44)                                                                                            
Out[46]: array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8., -inf])
In [47]: arr.max()                                                                                                   
Out[47]: 9.0
In [48]: Marr.max()                                                                                                  
Out[48]: 8.0

这篇关于NumPy-在掩码数组上更快的操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆