使用过滤器迭代的python计数元素 [英] python counting elements in iterable with filter

查看:122
本文介绍了使用过滤器迭代的python计数元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要计算列表中的元素,可以使用 ,但是如果只需要计算一些元素怎么办?

To count the elements in a list, you can use collections.Counter, but what if only some of the elements have to be counted?

我已经设置了这个示例(请注意:numpy只是为了方便起见.通常,列表将包含任意python对象)

I've set up this example (please note: numpy is just for convenience. In general the list will contain arbitrary python objects):

num_samples = 10000000
num_unique = 1000
numbers = np.random.randint(0, num_unique, num_samples)

我想计算数字在此列表中出现的频率,但是我只对数字< = 10感兴趣.

I would like to count how often a number occurs in this list, but I'm only interested in numbers <= 10.

这是要击败的基线.计数器只计算所有东西,这会产生一些开销.

This is the baseline to beat. The Counter just counts everything, which should produce some overhead.

%%time
counter = Counter(numbers)

CPU times: user 1.38 s, sys: 7.49 ms, total: 1.39 s
Wall time: 1.39 s

似乎无法过滤可迭代对象.但是以下代码是非常糟糕的样式,它遍历列表两次,而不是使用单个循环:

Filtering the iterable while counting it doesn't seem possible. But the following code is very bad style, it goes through the list twice, instead of using a single loop:

%%time
numbers = [number for number in numbers if number<=10]
counter = Counter(numbers)

CPU times: user 1.3 s, sys: 22.1 ms, total: 1.32 s
Wall time: 1.33 s

这种提速基本上可以忽略不计.让我们尝试一个循环:

That speedup is basically negligible. Let's try a single loop:

%%time

counter = defaultdict(int)
for number in numbers:
    if number > 10:
        continue
    counter[number]+=1

CPU times: user 1.99 s, sys: 11.5 ms, total: 2 s
Wall time: 2.01 s

我的单循环更糟.我认为Counter可以从基于C的实现中获利?

Well my single loop is much worse. I assume that Counter profits from a C based implementation ?

接下来我要尝试的是将列表表达式切换为生成器表达式.原则上,这应意味着生成器仅循环一次,而计数器则将其消耗掉.数字令人失望,但它的速度基本上与香草计数器一样快:

The next thing I tried was switching my list expression for a generator expression. In principle this should mean that the generator is only looped through once, while it is consumed by the Counter. The numbers are disappointing though, it is basically as fast as the vanilla Counter:

%%time
iterator = (number for number in numbers if number <= 10)
counter = Counter(iterator)

CPU times: user 1.38 s, sys: 8.51 ms, total: 1.39 s
Wall time: 1.39 s

在这一点上,我退后了一步,重新运行了几次.三种Counter版本(未过滤,列表理解,生成器表达式)的速度几乎相等. defaultdict版本始终慢得多.

At this point I took a step back and re-ran the numbers a few times. The three Counter versions (unfiltered, list comprehension, generator expression) are almost equal in speed. The defaultdict version is consistently much slower.

如何在同时过滤元素的同时有效地计算python列表中的元素?

How can I efficiently count elements in a python list, while filtering the elements at the same time ?

推荐答案

如果这是有关大型numpy数组的,则最好利用矢量化numpy运算.

If this is about large numpy arrays you'd better take advantage of vectorized numpy operations.

%%time
np.unique(numbers[numbers <= 10], return_counts=True)

输出:

Wall time: 31.2 ms

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
 array([10055, 10090,  9941, 10002,  9994,  9989, 10070,  9859, 10038,
        10028,  9965], dtype=int64))

为了进行比较,我自己的代码时间比您的时间高出很多.

​For comparison, my own timing of your code gave slighly higher times than yours.

这篇关于使用过滤器迭代的python计数元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆