为什么 random.sample 比 numpy 的 random.choice 快? [英] Why is random.sample faster than numpy's random.choice?

查看:82
本文介绍了为什么 random.sample 比 numpy 的 random.choice 快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一种在不替换特定数组a 的情况下进行采样的方法.我尝试了两种方法(参见下面的 MCVE),使用 random.sample()np.random.choice.

I need a way to sample without replacement a certain array a. I tried two approaches (see MCVE below), using random.sample() and np.random.choice.

我认为 numpy 函数会更快,但事实证明并非如此.在我的测试中,random.samplenp.random.choice 快约 15%.

I assumed the numpy function would be faster, but it turns out it is not. In my tests random.sample is ~15% faster than np.random.choice.

这是正确的,还是我在下面的示例中做错了什么?如果这是正确的,为什么?

Is this correct, or am I doing something wrong in my example below? If this is correct, why?

import numpy as np
import random
import time
from contextlib import contextmanager


@contextmanager
def timeblock(label):
    start = time.clock()
    try:
        yield
    finally:
        end = time.clock()
        print ('{} elapsed: {}'.format(label, end - start))


def f1(a, n_sample):
    return random.sample(range(len(a)), n_sample)


def f2(a, n_sample):
    return np.random.choice(len(a), n_sample, replace=False)


# Generate random array
a = np.random.uniform(1., 100., 10000)
# Number of samples' indexes to randomly take from a
n_sample = 100
# Number of times to repeat functions f1 and f2
N = 100000

with timeblock("random.sample"):
    for _ in range(N):
        f1(a, n_sample)

with timeblock("np.random.choice"):
    for _ in range(N):
        f2(a, n_sample)

推荐答案

TL;DR 从 numpy v1.17.0 开始推荐使用 numpy.random.default_rng()对象而不是 numpy.random.供选择:

TL;DR Since numpy v1.17.0 it's recommended to use numpy.random.default_rng() object instead of numpy.random. For choice:

import numpy as np

rng = np.random.default_rng()    # you can pass seed
rng.choice(...)    # interface is the same

除了 v1.17 中引入的随机 API 的其他更改之外,这个新版本的选择现在更加智能,并且在大多数情况下应该是最快的.为了向后兼容,旧版本保持不变!

Besides other changes with random API introduced in v1.17 this new version of choice is much smarter now and should be the fastest in most cases. The old version remains the same for backward compatibility!

正如评论中提到的,numpy 中存在一个长期存在的问题,即 np.random.choice 实现对 k << 无效.n 与 python 标准库中的 random.sample 相比.

As mentioned in the comments, there was a long-standing issue in numpy regarding np.random.choice implementation being ineffective for k << n compared to random.sample from python standard library.

问题是 np.random.choice(arr, size=k, replace=False) 被实现为 permutation(arr)[:k].在大数组和小k的情况下,计算整个数组排列是浪费时间和内存.标准 python 的 random.sample 以更直接的方式工作 - 它只是迭代采样而无需替换,要么跟踪已采样的内容,要么跟踪已采样的内容.

The problem was np.random.choice(arr, size=k, replace=False) being implemented as a permutation(arr)[:k]. In case of a large array and a small k, computing the whole array permutation is a waste of time and memory. The standard python's random.sample works in a more straightforward way - it just iteratively samples without replacement either keeping track of what is already sampled or from what to sample.

在 v1.17.0 numpy 中引入了 numpy.random 包的返工和改进(文档新增内容性能).我强烈建议至少看一下第一个链接.请注意,正如那里所说,为了向后兼容,旧的 numpy.random API 保持不变 - 它继续使用旧的实现.

In v1.17.0 numpy introduced rework and improvement of numpy.random package (docs, what's new, performance). I highly recommend to take a look at the first link at least. Note that, as it's said there, for backward compatibility the old numpy.random API remains the same - it keeps using an old implementations.

所以推荐的使用 random API 的新方法是使用 numpy.random.default_rng() object 而不是 numpy.random.请注意,它是一个对象,它也接受可选的种子参数,因此您可以以方便的方式传递它.它还默认使用不同的生成器,平均速度更快(有关详细信息,请参阅上面的性能链接).

So the new recommended way to use random API is to use numpy.random.default_rng() object instead of numpy.random. Note that it's an object and it accepts optional seed argument as well so you can pass it around in a convenient way. It also uses a different generator by default that is faster in average (see performance link above for the details).

关于您的情况,您现在可能想要使用 np.random.default_rng().choice(...).除了速度更快之外,由于改进了随机生成器,choice 本身变得更加智能.现在,它仅对足够大的数组(> 10000 个元素)和相对较大的 k(> 大小的 1/50)使用整个数组置换.否则,它使用 Floyd 的采样算法(简短描述numpy 实现).

Regarding your case you may want to use np.random.default_rng().choice(...) now. Besides being faster, thanks to the improved random generator, the choice itself became smarter. Now it uses the whole array permutation only for both sufficiently large array (>10000 elements) and relatively large k (>1/50 of the size). Otherwise it uses Floyd's sampling algorithm (short description, numpy implementation).

这是我的笔记本电脑的性能比较:

Here's the performance comparison on my laptop:

来自 10000 个元素 x 10000 次的数组的 100 个样本:

100 samples from array of 10000 elements x 10000 times:

random.sample elapsed: 0.8711776689742692
np.random.choice elapsed: 1.9704092079773545
np.random.default_rng().choice elapsed: 0.818919860990718

来自 10000 个元素 x 10000 次的数组的 1000 个样本:

1000 samples from array of 10000 elements x 10000 times:

random.sample elapsed: 8.785315042012371
np.random.choice elapsed: 1.9777243090211414
np.random.default_rng().choice elapsed: 1.05490942299366

来自 10000 个元素 x 10000 次的数组的 10000 个样本:

10000 samples from array of 10000 elements x 10000 times:

random.sample elapsed: 80.15063399000792
np.random.choice elapsed: 2.0218082449864596
np.random.default_rng().choice elapsed: 2.8596064270241186

以及我使用的代码:

import numpy as np
import random
from timeit import default_timer as timer
from contextlib import contextmanager


@contextmanager
def timeblock(label):
    start = timer()
    try:
        yield
    finally:
        end = timer()
        print ('{} elapsed: {}'.format(label, end - start))


def f1(a, n_sample):
    return random.sample(range(len(a)), n_sample)


def f2(a, n_sample):
    return np.random.choice(len(a), n_sample, replace=False)


def f3(a, n_sample):
    return np.random.default_rng().choice(len(a), n_sample, replace=False)


# Generate random array
a = np.random.uniform(1., 100., 10000)
# Number of samples' indexes to randomly take from a
n_sample = 100
# Number of times to repeat tested functions
N = 100000

print(f'{N} times {n_sample} samples')
with timeblock("random.sample"):
    for _ in range(N):
        f1(a, n_sample)

with timeblock("np.random.choice"):
    for _ in range(N):
        f2(a, n_sample)

with timeblock("np.random.default_rng().choice"):
    for _ in range(N):
        f3(a, n_sample)

这篇关于为什么 random.sample 比 numpy 的 random.choice 快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆