将Python序列(时间序列/数组)拆分为具有重叠的子序列 [英] Split Python sequence (time series/array) into subsequences with overlap

查看:384
本文介绍了将Python序列(时间序列/数组)拆分为具有重叠的子序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要提取给定窗口的时间序列/数组的所有子序列.例如:

I need to extract all subsequences of a time series/array of a given window. For example:

>>> ts = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> window = 3
>>> subsequences(ts, window)
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4],
       [3, 4, 5],
       [4, 5, 6],
       [5, 6, 7],
       [5, 7, 8],
       [6, 8, 9]])

在序列上进行迭代的幼稚方法当然很昂贵,例如:

Naive methods that iterate over the sequence are of course expensive, for example:

def subsequences(ts, window):
    res = []
    for i in range(ts.size - window + 1):
        subts = ts[i:i+window]
        subts.reset_index(drop=True, inplace=True)
        subts.name = None
        res.append(subts)
    return pd.DataFrame(res)

我发现了一种更好的方法,即复制序列,将其移动一个不同的值直到覆盖窗口,然后使用reshape拆分不同的序列.性能提高了约100倍,因为for循环遍历窗口大小而不是序列大小:

I found a better way by copying the sequence, shifting it by a different value until the window is covered, and splitting the different sequences with reshape. Performance is around 100x better, because the for loop iterates over the window size, and not the sequence size:

def subsequences(ts, window):
    res = []
    for i in range(window):
        subts = ts.shift(-i)[:-(ts.size%window)].reshape((ts.size // window, window))
        res.append(subts)
    return pd.DataFrame(np.concatenate(res, axis=0))

我已经看到pandas.stats.moment模块中的pandas包含几个滚动功能,我想它们的作用在某种程度上类似于子序列问题.该模块中是否有其他位置,或者熊猫中是否有其他位置可以使此功能更有效?

I've seen that pandas includes several rolling functions in the pandas.stats.moment module, and I guess what they do is somehow similar to the subsequencing problem. Is there anywhere in that module, or anywhere else in pandas to make this more efficient?

谢谢!

更新(解决方案):

基于@elyase答案,对于这种特定情况,有一个稍微简单的实现,让我在这里写下来,并解释它的作用:

Based on @elyase answer, for this specific case there is a slightly simpler implementation, let me write it down here, and explain what it's doing:

def subsequences(ts, window):
    shape = (ts.size - window + 1, window)
    strides = ts.strides * 2
    return np.lib.stride_tricks.as_strided(ts, shape=shape, strides=strides)

给定一维numpy数组,我们首先计算所得数组的形状.我们将在数组的每个位置处开始一行,只有最后几个元素例外,在最后几个元素处,接下来没有足够的元素来完成窗口.

Given the 1-D numpy array, we first compute the shape of the resulting array. We will have a row starting at each position of the array, with just the exception of the last few elements, at which starting them there wouldn't be enough elements next to complete the window.

请参见本说明中的第一个示例,我们从的最后一个数字如何开始是6,因为从7开始,我们无法创建包含三个元素的窗口.因此,行数是大小减去窗口再加上一.列数就是窗口.

See on the first example in this description, how the last number we start at is 6, because starting at 7, we can't create a window of three elements. So, the number of rows is the size minus the window plus one. The number of columns is simply the window.

接下来,棘手的部分是告诉我们如何使用刚刚定义的形状填充结果数组.

Next, the tricky part is telling how to fill the resulting array, with the shape we just defined.

为此,我们认为第一个元素将是第一个.然后,我们需要指定两个值(两个整数的元组作为参数strides的参数).这些值指定了我们需要在原始数组(一维数组)中填充第二个(二维数组)的步骤.

To do we consider that the first element will be the first. Then we need to specify two values (in a tuple of two integers as the argument to the parameter strides). The values specify the steps we need to do in the original array (the 1-D one) to fill the second (the 2-D one).

考虑一个不同的示例,我们要实现np.reshape函数,从9个元素的1D数组到3x3的数组.第一个元素填充第一个位置,然后其右边的一个元素将是1-D数组中的下一个元素,因此我们移动了 1步.然后,比较棘手的部分要填充第二行的第一个元素,我们应该执行3步,从0到4,请参见:

Consider a different example, where we want to implement the np.reshape function, from a 9 elements 1-D array, to a 3x3 array. The first element fills the first position, and then, the one at its right, would be the next on the 1-D array, so we move 1 step. Then, the tricky part, to fill the first element of the second row, we should do 3 steps, from the 0 to the 4, see:

>>> original = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> new = array([[0, 1, 2],
                 [3, 4, 5],
                 [6, 7, 8])]

因此,对于reshape,我们在这两个维度上的步骤将为(1, 3).对于我们来说,它存在重叠,实际上更简单.当我们向右移动以填充结果数组时,我们从1-D数组中的下一个位置开始,而当我们向右移动时,我们再次获取1-D数组中的下一个元素,即1步.因此,步骤将为(1, 1).

So, to reshape, our steps for the two dimensions would be (1, 3). For our case, where it exists overlap, it is actually simpler. When we move right to fill the resulting array, we start at the next position in the 1-D array, and when we move right, again we get the next element, so 1 step, in the 1-D array. So, the steps would be (1, 1).

最后一件事要注意. strides参数不接受我们使用的步骤",而是接受内存中的字节.要了解它们,我们可以使用numpy数组的strides方法.它返回一个带有跨步(以字节为单位的步幅)的元组,每个维都有一个元素.在我们的例子中,我们得到一个1元素元组,并且想要两次,所以我们有* 2.

There is only one last thing to note. The strides argument does not accept the "steps" we used, but instead the bytes in memory. To know them, we can use the strides method of numpy arrays. It returns a tuple with the strides (steps in bytes), with one element for each dimension. In our case we get a 1 element tuple, and we want it twice, so we have the * 2.

np.lib.stride_tricks.as_strided函数使用描述的方法执行填充,而无需复制数据,因此非常有效.

The np.lib.stride_tricks.as_strided function performs the filling using the described method without copying the data, which makes it quite efficient.

最后,请注意,此处发布的函数采用一维输入数组(不同于具有1个元素作为行或列的二维数组).参见输入数组的shape方法,您应该得到类似(N, )而不是(N, 1)的信息.这种方法在后者上将失败.请注意,@ elyase发布的方法可以处理二维输入数组(这就是该版本稍微简单一些的原因).

Finally, note that the function posted here assumes a 1-D input array (which is different from a 2-D array with 1 element as row or column). See the shape method of the input array, and you should get something like (N, ) and not (N, 1). This method would fail on the latter. Note that the method posted by @elyase handles two dimension input array (that's why this version is slightly simpler).

推荐答案

这比您在我的计算机中的快速版本快34倍:

This is 34x faster than your fast version in my machine:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

>>> rolling_window(ts.values, 3)
array([[0, 1, 2],
      [1, 2, 3],
      [2, 3, 4],
      [3, 4, 5],
      [4, 5, 6],
      [5, 6, 7],
      [6, 7, 8],
      [7, 8, 9]])

贷方到 Erik Rigtorp .

这篇关于将Python序列(时间序列/数组)拆分为具有重叠的子序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆