Python/NumPy 第一次出现子数组 [英] Python/NumPy first occurrence of subarray

查看:30
本文介绍了Python/NumPy 第一次出现子数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Python 或 NumPy 中,找出子数组第一次出现的最佳方法是什么?

例如,我有

a = [1, 2, 3, 4, 5, 6]b = [2, 3, 4]

找出 b 在 a 中的位置的最快方法(运行时)是什么?我知道对于字符串来说这非常容易,但是对于列表或 numpy ndarray 呢?

非常感谢!

[已编辑] 我更喜欢 numpy 解决方案,因为根据我的经验,numpy 向量化比 Python 列表理解要快得多.同时,大数组很大,所以我不想将其转换为字符串;那会(太)长.

解决方案

我假设您正在寻找特定于 numpy 的解决方案,而不是简单的列表理解或 for 循环.一种直接的方法是使用滚动窗口技术来搜索窗口大小合适.

这种方法很简单,工作正常,而且比任何纯 Python 解决方案都要快得多.对于许多用例来说,它应该足够了.但是,由于多种原因,这并不是最有效的方法.对于更复杂但在预期情况下渐近最优的方法,请参阅基于 numba滚动散列norok2的答案中的实现.

这是rolling_window函数:

<预><代码>>>>定义滚动窗口(a,大小):... shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)... strides = a.strides + (a.strides[-1],)...返回 numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)...

然后你可以做类似的事情

<预><代码>>>>a = numpy.arange(10)>>>numpy.random.shuffle(a)>>>一种数组([7, 3, 6, 8, 4, 0, 9, 2, 1, 5])>>>滚动窗口(a, 3) == [8, 4, 0]数组([[假,假,假],[假,假,假],[假,假,假],[真实,真实,真实],[假,假,假],[假,假,假],[假,假,假],[假,假,假]],dtype=bool)

要使其真正有用,您必须使用 all 沿轴 1 减少它:

<预><代码>>>>numpy.all(rolling_window(a, 3) == [8, 4, 0],axis=1)数组([假,假,假,真,假,假,假,假],dtype=bool)

然后你可以使用它,但是你会使用一个布尔数组.获取索引的简单方法:

<预><代码>>>>bool_indices = numpy.all(rolling_window(a, 3) == [8, 4, 0],axis=1)>>>numpy.mgrid[0:len(bool_indices)][bool_indices]数组([3])

对于列表,您可以调整这些滚动窗口迭代器之一使用类似的方法.

对于非常大型数组和子数组,您可以像这样节省内存:

<预><代码>>>>窗口 = 滚动窗口(a,3)>>>子 = [8, 4, 0]>>>点击数 = numpy.ones((len(a) - len(sub) + 1,), dtype=bool)>>>对于 i, x 在 enumerate(sub):... 点击 &= numpy.in1d(windows[:,i], [x])...>>>命中数组([假,假,假,真,假,假,假,假],dtype=bool)>>>hits.nonzero()(数组([3]),)

另一方面,这可能会慢一些.

In Python or NumPy, what is the best way to find out the first occurrence of a subarray?

For example, I have

a = [1, 2, 3, 4, 5, 6]
b = [2, 3, 4]

What is the fastest way (run-time-wise) to find out where b occurs in a? I understand for strings this is extremely easy, but what about for a list or numpy ndarray?

Thanks a lot!

[EDITED] I prefer the numpy solution, since from my experience numpy vectorization is much faster than Python list comprehension. Meanwhile, the big array is huge, so I don't want to convert it into a string; that will be (too) long.

解决方案

I'm assuming you're looking for a numpy-specific solution, rather than a simple list comprehension or for loop. One straightforward approach is to use the rolling window technique to search for windows of the appropriate size.

This approach is simple, works correctly, and is much faster than any pure Python solution. It should be sufficient for many use cases. However, it is not the most efficient approach possible, for a number of reasons. For an approach that is more complicated, but asymptotically optimal in the expected case, see the numba-based rolling hash implementation in norok2's answer.

Here's the rolling_window function:

>>> def rolling_window(a, size):
...     shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
...     strides = a.strides + (a. strides[-1],)
...     return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
... 

Then you could do something like

>>> a = numpy.arange(10)
>>> numpy.random.shuffle(a)
>>> a
array([7, 3, 6, 8, 4, 0, 9, 2, 1, 5])
>>> rolling_window(a, 3) == [8, 4, 0]
array([[False, False, False],
       [False, False, False],
       [False, False, False],
       [ True,  True,  True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

To make this really useful, you'd have to reduce it along axis 1 using all:

>>> numpy.all(rolling_window(a, 3) == [8, 4, 0], axis=1)
array([False, False, False,  True, False, False, False, False], dtype=bool)

Then you could use that however you'd use a boolean array. A simple way to get the index out:

>>> bool_indices = numpy.all(rolling_window(a, 3) == [8, 4, 0], axis=1)
>>> numpy.mgrid[0:len(bool_indices)][bool_indices]
array([3])

For lists you could adapt one of these rolling window iterators to use a similar approach.

For very large arrays and subarrays, you could save memory like this:

>>> windows = rolling_window(a, 3)
>>> sub = [8, 4, 0]
>>> hits = numpy.ones((len(a) - len(sub) + 1,), dtype=bool)
>>> for i, x in enumerate(sub):
...     hits &= numpy.in1d(windows[:,i], [x])
... 
>>> hits
array([False, False, False,  True, False, False, False, False], dtype=bool)
>>> hits.nonzero()
(array([3]),)

On the other hand, this will probably be somewhat slower.

这篇关于Python/NumPy 第一次出现子数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆