加快子阵列的改组和存储 [英] Speed up sub-array shuffling and storing
问题描述
我有一个整数列表(di
),另一个列表(rang_indx
)由整数的numpy
子数组组成(下面的代码).对于这些子数组中的每一个,我都需要在一个单独的列表(indx
)中存储一些随机元素,这些元素由di
列表给出.
I have a list of integers (di
), and another list (rang_indx
) made up of numpy
sub-arrays of integers (code below). For each of these sub-arrays, I need to store in a separate list (indx
) a number of random elements, given by the di
list.
我可以看到 np.random.shuffle()
不会随机排列子数组中的元素,而是会随机排列rang_indx
中的子数组本身,这不是我所需要的.因此,我需要使用for
循环首先对子数组(就位)进行混洗,然后再对另一个子数组(与zip()
组合)进行混洗,以生成indx
列表.
For what I can see np.random.shuffle()
will not shuffle the elements within the sub-arrays but the sub-arrays themselves within rang_indx
, which is not what I need. Hence, I need to use a for
loop to first shuffle the sub-arrays (in place), and then another one (combined with a zip()
) to generate the indx
list.
此函数作为较大代码的一部分被调用了数百万次.有什么方法可以加快这个过程吗?
This function is called millions of times as part of a larger code. Is there a way I can speed up the process?
import numpy as np
def func(di, rang_indx):
# Shuffle each sub-array in place.
for _ in rang_indx:
np.random.shuffle(_)
# For each shuffled sub-array, only keep as many elements as those
# indicated by the 'di' array.
indx = [_[:i] for (_, i) in zip(*[rang_indx, di.astype(int)])]
return indx
# This data is not fixed, and will change with each call to func()
di = np.array([ 4., 2., 0., 600., 12., 22., 13., 21., 25., 25., 12., 11.,
7., 12., 10., 13., 5., 10.])
rang_indx = [np.array([]), np.array([189, 195, 209, 214, 236, 237, 255, 286, 290, 296, 301, 304, 321,
323, 327, 329]), np.array([164, 171, 207, 217, 225, 240, 250, 263, 272, 279, 284, 285, 289]), np.array([101, 162, 168, 177, 179, 185, 258, 261, 264, 269, 270, 278, 281,
287, 293, 298]), np.array([111, 127, 143, 156, 159, 161, 181, 182, 183, 194, 196, 198, 204,
205, 210, 212, 235, 239, 267, 268, 297]), np.array([107, 116, 120, 128, 130, 136, 137, 144, 152, 155, 157, 166, 169,
170, 184, 186, 192, 218, 220, 226, 228, 241, 245, 246, 247, 251,
252, 253]), np.array([ 99, 114, 118, 121, 131, 134, 158, 216, 219, 221, 224, 231, 233,
234, 243, 244]), np.array([ 34, 37, 38, 48, 56, 78, 84, 100, 108, 117, 122, 123, 132,
149, 151, 153, 163, 178, 180, 191, 199, 202, 208, 211]), np.array([ 31, 40, 41, 45, 51, 53, 57, 60, 61, 66, 67, 69, 71,
75, 85, 90, 95, 96, 167, 173, 174, 176, 188, 190, 197, 206]), np.array([ 0, 1, 2, 3, 6, 11, 12, 13, 17, 25, 33, 36, 47,
58, 64, 76, 87, 94, 160, 165, 172, 175, 187, 193, 201, 203]), np.array([ 4, 16, 18, 19, 109, 113, 115, 124, 138, 142, 145, 150]), np.array([103, 105, 106, 112, 125, 135, 139, 140, 141, 146, 147, 154]), np.array([102, 104, 110, 119, 126, 129, 133, 148]), np.array([29, 32, 42, 43, 55, 63, 72, 77, 79, 83, 91, 92]), np.array([35, 49, 59, 73, 74, 81, 86, 88, 89, 97, 98]), np.array([30, 39, 44, 46, 50, 52, 54, 62, 65, 68, 80, 82, 93]), np.array([ 8, 10, 15, 27, 70]), np.array([ 5, 7, 9, 14, 20, 21, 22, 23, 24, 26, 28])]
func(di, rang_indx)
推荐答案
方法1:这是一个想法,目的是在我们循环并仅使用一个循环时保持最少的工作-
Approach #1 : Here's one idea with the intention to keep minimal work when we loop and use one loop only -
- 在间隔
[0,1)
中创建一个2D
随机数组以覆盖最大值.子数组的长度. - 对于每个子数组,将无效位置设置为
1.0
.为每一行获取argsort.对应于无效位置的那些1将留在后面,因为原始随机数组中没有1.因此,我们有了索引数组. - 将这些索引数组的每一行切片为
di
中列出的长度范围. - 开始循环,并使用切片的切片从
rang_indx
切片每个子数组.
- Create a
2D
random array in interval[0,1)
to cover the max. length of subarrays. - For each subarray, set the invalid places to
1.0
. Get argsort for each row. Those 1s corresponding to the invalid places would stay at the back because there were no 1s in the original random array. Thus, we have the indices array. - Slice each row of those indices array to the extent of the lengths listed in
di
. - Start a loop and slice each subarray from
rang_indx
using those sliced indices.
因此,实现-
lens = np.array([len(i) for i in rang_indx])
di0 = np.minimum(lens, di.astype(int))
invalid_mask = lens[:,None] <= np.arange(lens.max())
rand_nums = np.random.rand(len(lens), lens.max())
rand_nums[invalid_mask] = 1
shuffled_indx = np.argpartition(rand_nums, lens-1, axis=1)
out = []
for i,all_idx in enumerate(shuffled_indx):
if lens[i]==0:
out.append(np.array([]))
else:
slice_idx = all_idx[:di0[i]]
out.append(rang_indx[i][slice_idx])
方法2::另一种在循环内以高效方式完成许多设置工作的方法-
Approach #2 : Another way with doing much of the setup work in an efficient manner within the loop -
lens = np.array([len(i) for i in rang_indx])
di0 = np.minimum(lens, di.astype(int))
out = []
for i in range(len(lens)):
if lens[i]==0:
out.append(np.array([]))
else:
k = di0[i]
slice_idx = np.argpartition(np.random.rand(lens[i]), k-1)[:k]
out.append(rang_indx[i][slice_idx])
这篇关于加快子阵列的改组和存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!