从一维数组构建高效的 Numpy 二维数组 [英] Efficient Numpy 2D array construction from 1D array

查看:35
本文介绍了从一维数组构建高效的 Numpy 二维数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的数组:

A = array([1,2,3,4,5,6,7,8,9,10])

我正在尝试获取这样的数组:

And I am trying to get an array like this:

B = array([[1,2,3],
          [2,3,4],
          [3,4,5],
          [4,5,6]])

其中每一行(具有固定的任意宽度)移动一个.A 的数组有 10k 条记录,我正在尝试在 Numpy 中找到一种有效的方法.目前我正在使用 vstack 和一个很慢的 for 循环.有没有更快的方法?

Where each row (of a fixed arbitrary width) is shifted by one. The array of A is 10k records long and I'm trying to find an efficient way of doing this in Numpy. Currently I am using vstack and a for loop which is slow. Is there a faster way?

width = 3 # fixed arbitrary width
length = 10000 # length of A which I wish to use
B = A[0:length + 1]
for i in range (1, length):
    B = np.vstack((B, A[i, i + width + 1]))

推荐答案

实际上,有一种更有效的方法来做到这一点......使用 vstack 等的缺点是你制作数组的副本.

Actually, there's an even more efficient way to do this... The downside to using vstack etc, is that you're making a copy of the array.

顺便说一句,这实际上与@Paul 的回答相同,但我发布此内容只是为了更详细地解释事情......

Incidentally, this is effectively identical to @Paul's answer, but I'm posting this just to explain things in a bit more detail...

有一种方法可以只用视图来做到这一点,这样没有重复内存.

There's a way to do this with just views so that no memory is duplicated.

我直接从 Erik Rigtorp 给 numpy 的帖子中借用了这个-讨论,后者又从 Keith Goodman 的瓶颈借来(这很有用!).

I'm directly borrowing this from Erik Rigtorp's post to numpy-discussion, who in turn, borrowed it from Keith Goodman's Bottleneck (Which is quite useful!).

基本技巧是直接操作 strides数组(对于一维数组):

The basic trick is to directly manipulate the strides of the array (For one-dimensional arrays):

import numpy as np

def rolling(a, window):
    shape = (a.size - window + 1, window)
    strides = (a.itemsize, a.itemsize)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

a = np.arange(10)
print rolling(a, 3)

其中 a 是您的输入数组,而 window 是您想要的窗口的长度(在您的情况下为 3).

Where a is your input array and window is the length of the window that you want (3, in your case).

这产生:

[[0 1 2]
 [1 2 3]
 [2 3 4]
 [3 4 5]
 [4 5 6]
 [5 6 7]
 [6 7 8]
 [7 8 9]]

但是,原始a和返回的数组之间绝对没有重复的内存.这意味着它速度快,并且比其他选项更好地扩展.

However, there is absolutely no duplication of memory between the original a and the returned array. This means that it's fast and scales much better than other options.

例如(使用 a = np.arange(100000)window=3):

For example (using a = np.arange(100000) and window=3):

%timeit np.vstack([a[i:i-window] for i in xrange(window)]).T
1000 loops, best of 3: 256 us per loop

%timeit rolling(a, window)
100000 loops, best of 3: 12 us per loop

如果我们将其概括为沿 N 维数组最后一个轴的滚动窗口",我们将得到 Erik Rigtorp 的滚动窗口"函数:

If we generalize this to a "rolling window" along the last axis for an N-dimensional array, we get Erik Rigtorp's "rolling window" function:

import numpy as np

def rolling_window(a, window):
   """
   Make an ndarray with a rolling window of the last dimension

   Parameters
   ----------
   a : array_like
       Array to add rolling window to
   window : int
       Size of rolling window

   Returns
   -------
   Array that is a view of the original array with a added dimension
   of size w.

   Examples
   --------
   >>> x=np.arange(10).reshape((2,5))
   >>> rolling_window(x, 3)
   array([[[0, 1, 2], [1, 2, 3], [2, 3, 4]],
          [[5, 6, 7], [6, 7, 8], [7, 8, 9]]])

   Calculate rolling mean of last dimension:
   >>> np.mean(rolling_window(x, 3), -1)
   array([[ 1.,  2.,  3.],
          [ 6.,  7.,  8.]])

   """
   if window < 1:
       raise ValueError, "`window` must be at least 1."
   if window > a.shape[-1]:
       raise ValueError, "`window` is too long."
   shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
   strides = a.strides + (a.strides[-1],)
   return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

那么,让我们看看这里发生了什么... 操作一个数组的strides 可能看起来有点神奇,但是一旦你理解了发生了什么,它就完全不是.numpy 数组的步幅描述了沿给定轴增加一个值必须采取的步骤的大小(以字节为单位).所以,在64位浮点数的一维数组的情况下,每一项的长度为8个字节,x.strides(8,).

So, let's look into what's going on here... Manipulating an array's strides may seem a bit magical, but once you understand what's going on, it's not at all. The strides of a numpy array describe the size in bytes of the steps that must be taken to increment one value along a given axis. So, in the case of a 1-dimensional array of 64-bit floats, the length of each item is 8 bytes, and x.strides is (8,).

x = np.arange(9)
print x.strides

现在,如果我们将其重塑为一个 2D、3x3 的数组,步幅将是 (3 * 8, 8),因为我们必须跳过 24 个字节才能沿第一个步骤增加一个步骤轴,8 个字节沿第二个轴递增一步.

Now, if we reshape this into a 2D, 3x3 array, the strides will be (3 * 8, 8), as we would have to jump 24 bytes to increment one step along the first axis, and 8 bytes to increment one step along the second axis.

y = x.reshape(3,3)
print y.strides

同样,转置与反转数组的步幅相同:

Similarly a transpose is the same as just reversing the strides of an array:

print y
y.strides = y.strides[::-1]
print y

显然,数组的步幅和数组的形状密切相关.如果我们更改一个,我们必须相应地更改另一个,否则我们将无法获得实际保存数组值的内存缓冲区的有效描述.

Clearly, the strides of an array and the shape of an array are intimately linked. If we change one, we have to change the other accordingly, otherwise we won't have a valid description of the memory buffer that actually holds the values of the array.

因此,如果你想同时改变两个数组的形状和大小,你不能仅仅通过设置x.stridesx.shape,即使新的步幅和形状是兼容的.

Therefore, if you want to change both the shape and size of an array simultaneously, you can't do it just by setting x.strides and x.shape, even if the new strides and shape are compatible.

这就是 numpy.lib.as_strided 的用武之地.它实际上是一个非常简单的函数,它同时设置了数组的步幅和形状.

That's where numpy.lib.as_strided comes in. It's actually a very simple function that just sets the strides and shape of an array simultaneously.

它会检查两者是否兼容,但不会检查旧步幅和新形状是否兼容,如果您独立设置两者会发生这种情况.(它实际上是通过 numpy 的 __array_interface__,它允许任意类将内存缓冲区描述为一个 numpy 数组.)

It checks that the two are compatible, but not that the old strides and new shape are compatible, as would happen if you set the two independently. (It actually does this through numpy's __array_interface__, which allows arbitrary classes to describe a memory buffer as a numpy array.)

所以,我们所做的只是沿着一个轴向前步进一个项目(在 64 位数组的情况下为 8 个字节),但也仅沿另一个轴向前步进 8 个字节.

So, all we've done is made it so that steps one item forward (8 bytes in the case of a 64-bit array) along one axis, but also only steps 8 bytes forward along the other axis.

换句话说,在窗口"大小为 3 的情况下,数组的形状为 (whatever, 3),而不是步进一个完整的 3 * x.itemsize 对于第二维,它只向前移动一项,有效地使新数组的行成为原始数组的移动窗口"视图.

In other words, in case of a "window" size of 3, the array has a shape of (whatever, 3), but instead of stepping a full 3 * x.itemsize for the second dimension, it only steps one item forward, effectively making the rows of new array a "moving window" view into the original array.

(这也意味着 x.shape[0] * x.shape[1] 将与新数组的 x.size 不同.)

(This also means that x.shape[0] * x.shape[1] will not be the same as x.size for your new array.)

无论如何,希望这能让事情变得更清楚..

At any rate, hopefully that makes things slightly clearer..

这篇关于从一维数组构建高效的 Numpy 二维数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆