加快子阵列的改组和存储 [英] Speed up sub-array shuffling and storing

查看:62
本文介绍了加快子阵列的改组和存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个整数列表(di),另一个列表(rang_indx)由整数的numpy子数组组成(下面的代码).对于这些子数组中的每一个,我都需要在一个单独的列表(indx)中存储一些随机元素,这些元素由di列表给出.

I have a list of integers (di), and another list (rang_indx) made up of numpy sub-arrays of integers (code below). For each of these sub-arrays, I need to store in a separate list (indx) a number of random elements, given by the di list.

我可以看到 np.random.shuffle() 不会随机排列子数组中的元素,而是会随机排列rang_indx中的子数组本身,这不是我所需要的.因此,我需要使用for循环首先对子数组(就位)进行混洗,然后再对另一个子数组(与zip()组合)进行混洗,以生成indx列表.

For what I can see np.random.shuffle() will not shuffle the elements within the sub-arrays but the sub-arrays themselves within rang_indx, which is not what I need. Hence, I need to use a for loop to first shuffle the sub-arrays (in place), and then another one (combined with a zip()) to generate the indx list.

此函数作为较大代码的一部分被调用了数百万次.有什么方法可以加快这个过程吗?

This function is called millions of times as part of a larger code. Is there a way I can speed up the process?

import numpy as np


def func(di, rang_indx):
    # Shuffle each sub-array in place.
    for _ in rang_indx:
        np.random.shuffle(_)

    # For each shuffled sub-array, only keep as many elements as those
    # indicated by the 'di' array.
    indx = [_[:i] for (_, i) in zip(*[rang_indx, di.astype(int)])]

    return indx


# This data is not fixed, and will change with each call to func()
di = np.array([ 4.,  2.,   0.,   600.,  12.,  22.,  13.,  21.,  25.,  25.,  12.,  11.,
         7.,  12.,  10.,  13.,   5.,  10.])
rang_indx = [np.array([]), np.array([189, 195, 209, 214, 236, 237, 255, 286, 290, 296, 301, 304, 321,
       323, 327, 329]), np.array([164, 171, 207, 217, 225, 240, 250, 263, 272, 279, 284, 285, 289]), np.array([101, 162, 168, 177, 179, 185, 258, 261, 264, 269, 270, 278, 281,
       287, 293, 298]), np.array([111, 127, 143, 156, 159, 161, 181, 182, 183, 194, 196, 198, 204,
       205, 210, 212, 235, 239, 267, 268, 297]), np.array([107, 116, 120, 128, 130, 136, 137, 144, 152, 155, 157, 166, 169,
       170, 184, 186, 192, 218, 220, 226, 228, 241, 245, 246, 247, 251,
       252, 253]), np.array([ 99, 114, 118, 121, 131, 134, 158, 216, 219, 221, 224, 231, 233,
       234, 243, 244]), np.array([ 34,  37,  38,  48,  56,  78,  84, 100, 108, 117, 122, 123, 132,
       149, 151, 153, 163, 178, 180, 191, 199, 202, 208, 211]), np.array([ 31,  40,  41,  45,  51,  53,  57,  60,  61,  66,  67,  69,  71,
        75,  85,  90,  95,  96, 167, 173, 174, 176, 188, 190, 197, 206]), np.array([  0,   1,   2,   3,   6,  11,  12,  13,  17,  25,  33,  36,  47,
        58,  64,  76,  87,  94, 160, 165, 172, 175, 187, 193, 201, 203]), np.array([  4,  16,  18,  19, 109, 113, 115, 124, 138, 142, 145, 150]), np.array([103, 105, 106, 112, 125, 135, 139, 140, 141, 146, 147, 154]), np.array([102, 104, 110, 119, 126, 129, 133, 148]), np.array([29, 32, 42, 43, 55, 63, 72, 77, 79, 83, 91, 92]), np.array([35, 49, 59, 73, 74, 81, 86, 88, 89, 97, 98]), np.array([30, 39, 44, 46, 50, 52, 54, 62, 65, 68, 80, 82, 93]), np.array([ 8, 10, 15, 27, 70]), np.array([ 5,  7,  9, 14, 20, 21, 22, 23, 24, 26, 28])]

func(di, rang_indx)

推荐答案

方法1:这是一个想法,目的是在我们循环并仅使用一个循环时保持最少的工作-

Approach #1 : Here's one idea with the intention to keep minimal work when we loop and use one loop only -

  1. 在间隔[0,1)中创建一个2D随机数组以覆盖最大值.子数组的长度.
  2. 对于每个子数组,将无效位置设置为1.0.为每一行获取argsort.对应于无效位置的那些1将留在后面,因为原始随机数组中没有1.因此,我们有了索引数组.
  3. 将这些索引数组的每一行切片为di中列出的长度范围.
  4. 开始循环,并使用切片的切片从rang_indx切片每个子数组.
  1. Create a 2D random array in interval [0,1) to cover the max. length of subarrays.
  2. For each subarray, set the invalid places to 1.0. Get argsort for each row. Those 1s corresponding to the invalid places would stay at the back because there were no 1s in the original random array. Thus, we have the indices array.
  3. Slice each row of those indices array to the extent of the lengths listed in di.
  4. Start a loop and slice each subarray from rang_indx using those sliced indices.

因此,实现-

lens = np.array([len(i) for i in rang_indx])
di0 = np.minimum(lens, di.astype(int))
invalid_mask = lens[:,None] <= np.arange(lens.max())
rand_nums = np.random.rand(len(lens), lens.max())
rand_nums[invalid_mask] = 1
shuffled_indx = np.argpartition(rand_nums, lens-1, axis=1)

out = []
for i,all_idx in enumerate(shuffled_indx):
    if lens[i]==0:
        out.append(np.array([]))
    else:
        slice_idx = all_idx[:di0[i]]
        out.append(rang_indx[i][slice_idx])

方法2::另一种在循环内以高效方式完成许多设置工作的方法-

Approach #2 : Another way with doing much of the setup work in an efficient manner within the loop -

lens = np.array([len(i) for i in rang_indx])
di0 = np.minimum(lens, di.astype(int))
out = []
for i in range(len(lens)):
    if lens[i]==0:
        out.append(np.array([]))
    else:
        k = di0[i]
        slice_idx = np.argpartition(np.random.rand(lens[i]), k-1)[:k]
        out.append(rang_indx[i][slice_idx])

这篇关于加快子阵列的改组和存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆