将两个 numpy 数组同时混洗的更好方法 [英] Better way to shuffle two numpy arrays in unison

查看:26
本文介绍了将两个 numpy 数组同时混洗的更好方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个不同形状的 numpy 数组,但具有相同的长度(前导维度).我想对它们中的每一个进行洗牌,以便相应的元素继续对应——即根据它们的前导索引一致地洗牌.

此代码有效,并说明了我的目标:

def shuffle_in_unison(a, b):断言 len(a) == len(b)shuffled_a = numpy.empty(a.shape, dtype=a.dtype)shuffled_b = numpy.empty(b.shape, dtype=b.dtype)排列 = numpy.random.permutation(len(a))对于枚举(排列)中的 old_index、new_index:shuffled_a[new_index] = a[old_index]shuffled_b[new_index] = b[old_index]返回shuffled_a,shuffled_b

例如:

<预><代码>>>>a = numpy.asarray([[1, 1], [2, 2], [3, 3]])>>>b = numpy.asarray([1, 2, 3])>>>shuffle_in_unison(a, b)(数组([[2, 2],[1, 1],[3, 3]]), 数组([2, 1, 3]))

但是,这感觉笨重、低效且缓慢,并且需要复制数组——我宁愿将它们就地洗牌,因为它们会非常大.

有没有更好的方法来解决这个问题?更快的执行和更低的内存使用是我的主要目标,但优雅的代码也会很好.

我的另一个想法是:

def shuffle_in_unison_scary(a, b):rng_state = numpy.random.get_state()numpy.random.shuffle(a)numpy.random.set_state(rng_state)numpy.random.shuffle(b)

这有效......但它有点可怕,因为我认为它会继续工作几乎没有保证——例如,它看起来不像是那种可以保证在 numpy 版本中存活的东西.

解决方案

您的可怕"解决方案在我看来并不可怕.对两个相同长度的序列调用 shuffle() 会导致对随机数生成器的调用次数相同,而这些是 shuffle 算法中唯一的随机"元素.通过重置状态,您可以确保对随机数生成器的调用在第二次对 shuffle() 的调用中给出相同的结果,因此整个算法将生成相同的排列.

如果您不喜欢这样,另一种解决方案是将您的数据存储在一个数组中,而不是从一开始就存储在两个数组中,然后在这个模拟您现在拥有的两个数组的单个数组中创建两个视图.您可以将单个数组用于洗牌,将视图用于所有其他目的.

示例:假设数组 ab 如下所示:

a = numpy.array([[[ 0., 1., 2.],[3., 4., 5.]],[[6., 7., 8.],[ 9., 10., 11.]],[[12., 13., 14.],[ 15., 16., 17.]]])b = numpy.array([[ 0., 1.],[2., 3.],[4., 5.]])

我们现在可以构造一个包含所有数据的数组:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]# array([[ 0., 1., 2., 3., 4., 5., 0., 1.],# [ 6., 7., 8., 9., 10., 11., 2., 3.],# [ 12., 13., 14., 15., 16., 17., 4., 5.]])

现在我们创建模拟原始ab的视图:

a2 = c[:, :a.size//len(a)].reshape(a.shape)b2 = c[:, a.size//len(a):].reshape(b.shape)

a2b2 的数据与c 共享.要同时打乱两个数组,请使用 numpy.random.shuffle(c).

在生产代码中,您当然会尽量避免创建原始的 ab 并立即创建 c、<代码>a2 和 b2.

该解决方案适用于 ab 具有不同数据类型的情况.

I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.

This code works, and illustrates my goals:

def shuffle_in_unison(a, b):
    assert len(a) == len(b)
    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
    permutation = numpy.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b

For example:

>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
       [1, 1],
       [3, 3]]), array([2, 1, 3]))

However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.

Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.

One other thought I had was this:

def shuffle_in_unison_scary(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.

解决方案

Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.

If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.

Example: Let's assume the arrays a and b look like this:

a = numpy.array([[[  0.,   1.,   2.],
                  [  3.,   4.,   5.]],

                 [[  6.,   7.,   8.],
                  [  9.,  10.,  11.]],

                 [[ 12.,  13.,  14.],
                  [ 15.,  16.,  17.]]])

b = numpy.array([[ 0.,  1.],
                 [ 2.,  3.],
                 [ 4.,  5.]])

We can now construct a single array containing all the data:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[  0.,   1.,   2.,   3.,   4.,   5.,   0.,   1.],
#        [  6.,   7.,   8.,   9.,  10.,  11.,   2.,   3.],
#        [ 12.,  13.,  14.,  15.,  16.,  17.,   4.,   5.]])

Now we create views simulating the original a and b:

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).

In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.

This solution could be adapted to the case that a and b have different dtypes.

这篇关于将两个 numpy 数组同时混洗的更好方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆