使用多个自定义索引范围构建numpy数组,而无需显式循环 [英] Build numpy array with multiple custom index ranges without explicit loop
问题描述
在Numpy中,有没有一种Pythonic的方法可以创建具有array1和array2的自定义范围的array3而没有循环?在范围上进行迭代的直接解决方案是可行的,但是由于我的数组遇到了数百万个项目,因此我正在寻找一种更有效的解决方案(也许也是语法糖).
In Numpy, is there a pythonic way to create array3 with custom ranges from array1 and array2 without a loop? The straightforward solution of iterating over the ranges works but since my arrays run into millions of items, I am looking for a more efficient solution (maybe syntactic sugar too).
例如,
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
array3 = np.concatenate([np.arange(array1[i], array2[i]) for i in
np.arange(0,len(array1))])
print array3
结果:[10,11,12,13,65,66,67,68,69,200,201,202,203]
.
推荐答案
预期方法
我将倒退介绍如何解决此问题.
Prospective Approach
I will go backwards on how to approach this problem.
获取问题中列出的示例.我们有-
Take the sample listed in the question. We have -
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
现在,查看所需的结果-
Now, look at the desired result -
result: [10,11,12,13,65,66,67,68,69,200,201,202,203]
让我们计算组长,因为我们接下来需要用它们来解释求解方法.
Let's calculate the group lengths, as we would be needing those to explain the solution approach next.
In [58]: lens = array2 - array1
In [59]: lens
Out[59]: array([4, 5, 4])
这个想法是使用1
的初始化数组,当在整个长度上累加总和时,将得到我们想要的结果.
这种累加的总和将是我们解决方案的最后一步.
为什么要初始化1
?好吧,因为我们有一个数组,它以1
的步长递增,但在有移位的特定位置除外
对应于新加入的团体.
The idea is to use 1
's initialized array, which when cumumlative summed across the entire length would give us the desired result.
This cumumlative summation would be the last step to our solution.
Why 1
's initialized? Well, because we have an array that increasing in steps of 1
's except at specific places where we have shifts
corresponding to new groups coming in.
现在,由于cumsum
是最后一步,所以在它之前的步骤应该给我们类似-
Now, since cumsum
would be the last step, so the step before it should give us something like -
array([ 10, 1, 1, 1, 52, 1, 1, 1, 1, 131, 1, 1, 1])
如前所述,在特定位置用[10,52,131]
填充1
.该10
似乎来自array1
中的第一个元素,但是其余的呢?
第二个52
作为65-13
出现(看着result
),其中13
出现在以10
开头的组中,并且由于长度的原因而运行.
第一组4
.因此,如果我们执行65 - 10 - 4
,我们将获得51
,然后将1
添加到边界停止的容纳位置,我们将得到52
,即
所需的移位值.同样,我们会得到131
.
As discussed before, it's 1
's filled with [10,52,131]
at specific places. That 10
seems to be coming in from the first element in array1
, but what about the rest?
The second one 52
came in as 65-13
(looking at the result
) and in it 13
came in the group that started with 10
and ran because of the length of
the first group 4
. So, if we do 65 - 10 - 4
, we will get 51
and then add 1
to accomodate for boundary stop, we would have 52
, which is the
desired shifting value. Similarly, we would get 131
.
因此,可以像这样计算那些shifting-values
-
Thus, those shifting-values
could be computed, like so -
In [62]: np.diff(array1) - lens[:-1]+1
Out[62]: array([ 52, 131])
接下来,要获得发生这种偏移的那些shifting-places
,我们可以简单地对组长度进行累积求和-
Next up, to get those shifting-places
where such shifts occur, we can simply do cumulative summation on the group lengths -
In [65]: lens[:-1].cumsum()
Out[65]: array([4, 9])
为完整起见,我们需要在0
前面附加shifting-places
数组,在shifting-values
之前附加array1[0]
数组.
For completeness, we need to pre-append 0
with the array of shifting-places
and array1[0]
for shifting-values
.
因此,我们将逐步介绍我们的方法!
So, we are set to present our approach in a step-by-step format!
1]获取每个组的长度:
1] Get lengths of each group :
lens = array2 - array1
2]获取发生移位的索引,并将值放入1
的初始化数组中:
2] Get indices at which shifts occur and values to be put in 1
's initialized array :
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
3]设置1
的初始化ID数组,以便将这些值插入到上一步中列出的那些索引处:
3] Setup 1
's initialized ID array for inserting those values at those indices listed in the step before :
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
4]最后,对ID数组进行累积求和:
4] Finally do cumulative summation on the ID array :
output = id_arr.cumsum()
以函数格式列出,我们将有-
Listed in a function format, we would have -
def using_ones_cumsum(array1, array2):
lens = array2 - array1
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
return id_arr.cumsum()
它也适用于重叠范围!
In [67]: array1 = np.array([10, 11, 200])
...: array2 = np.array([14, 18, 204])
...:
In [68]: using_ones_cumsum(array1, array2)
Out[68]:
array([ 10, 11, 12, 13, 11, 12, 13, 14, 15, 16, 17, 200, 201,
202, 203])
运行时测试
让我们将建议的方法与 @unutbu's flatnonzero based solution
中的其他矢量化方法相比,要好得多循环方法-
Let's time the proposed approach against the other vectorized approach in @unutbu's flatnonzero based solution
, which already proved to be much better than the loopy approach -
In [38]: array1, array2 = (np.random.choice(range(1, 11), size=10**4, replace=True)
...: .cumsum().reshape(2, -1, order='F'))
In [39]: %timeit using_flatnonzero(array1, array2)
1000 loops, best of 3: 889 µs per loop
In [40]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 235 µs per loop
改进!
现在,按代码编写NumPy不喜欢附加.因此,对于下面列出的稍有改进的版本,可以避免使用这些np.hstack
调用-
Improvement!
Now, codewise NumPy doesn't like appending. So, those np.hstack
calls could be avoided for a slightly improved version as listed below -
def get_ranges_arr(starts,ends):
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
让我们反对我们的原始方法-
Let's time it against our original approach -
In [151]: array1,array2 = (np.random.choice(range(1, 11),size=10**4, replace=True)\
...: .cumsum().reshape(2, -1, order='F'))
In [152]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 276 µs per loop
In [153]: %timeit get_ranges_arr(array1, array2)
10000 loops, best of 3: 193 µs per loop
因此,我们在其中实现了 30%
的性能提升!
So, we have a 30%
performance boost there!
这篇关于使用多个自定义索引范围构建numpy数组,而无需显式循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!