为什么Collections.counter这么慢? [英] Why is Collections.counter so slow?

查看:59
本文介绍了为什么Collections.counter这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解决Rosalind基本问题,即计算给定序列中的核苷酸,并将结果返回到列表中.对于那些不熟悉生物信息学的人,它只是计算字符串中4个不同字符('A','C','G','T')出现的次数.

I'm trying to solve a Rosalind basic problem of counting nucleotides in a given sequence, and returning the results in a list. For those ones not familiar with bioinformatics it's just counting the number of occurrences of 4 different characters ('A','C','G','T') inside a string.

我希望collections.Counter是最快的方法(首先是因为他们声称自己具有高性能,其次是因为我看到很多人在使用它来解决这个特定问题).

I expected collections.Counter to be the fastest method (first because they claim to be high-performance, and second because I saw a lot of people using it for this specific problem).

但令我惊讶的是这种方法最慢

我比较了三种不同的方法,分别使用timeit和运行两种类型的实验:

I compared three different methods, using timeit and running two types of experiments:

  • 长时间运行几次
  • 短时间运行很多次.

这是我的代码:

import timeit
from collections import Counter

# Method1: using count
def method1(seq):
    return [seq.count('A'), seq.count('C'), seq.count('G'), seq.count('T')]

# method 2: using a loop
def method2(seq):
    r = [0, 0, 0, 0]
    for i in seq:
        if i == 'A':
            r[0] += 1
        elif i == 'C':
            r[1] += 1
        elif i == 'G':
            r[2] += 1
        else:
            r[3] += 1
    return r

# method 3: using Collections.counter
def method3(seq):
    counter = Counter(seq)
    return [counter['A'], counter['C'], counter['G'], counter['T']]


if __name__ == '__main__':

    # Long dummy sequence
    long_seq = 'ACAGCATGCA' * 10000000
    # Short dummy sequence
    short_seq = 'ACAGCATGCA' * 1000

    # Test 1: Running a long sequence once
    print timeit.timeit("method1(long_seq)", setup='from __main__ import method1, long_seq', number=1)
    print timeit.timeit("method2(long_seq)", setup='from __main__ import method2, long_seq', number=1)
    print timeit.timeit("method3(long_seq)", setup='from __main__ import method3, long_seq', number=1)

    # Test2: Running a short sequence lots of times
    print timeit.timeit("method1(short_seq)", setup='from __main__ import method1, short_seq', number=10000)
    print timeit.timeit("method2(short_seq)", setup='from __main__ import method2, short_seq', number=10000)
    print timeit.timeit("method3(short_seq)", setup='from __main__ import method3, short_seq', number=10000)

结果:

Test1: 
Method1: 0.224009990692
Method2: 13.7929501534
Method3: 18.9483819008

Test2:
Method1: 0.224207878113
Method2: 13.8520510197
Method3: 18.9861831665

在两个实验中,方法1均比方法2和3快!

Method 1 is way faster than method 2 and 3 for both experiments!!

所以我有一系列问题:

  • 我做错什么了吗,或者确实比其他两种方法慢?有人可以运行相同的代码并共享结果吗?

  • Am I doing something wrong or it is indeed slower than the other two approaches? Could someone run the same code and share the results?

如果我的结果正确,(也许应该是另一个问题),有没有比使用方法1更快的方法来解决此问题?

In case my results are correct, (and maybe this should be another question) is there a faster method to solve this problem than using method 1?

如果count更快,那么collections.Counter怎么处理?

If count is faster, then what's the deal with collections.Counter?

推荐答案

这不是因为collections.Counter速度很慢,它实际上相当快,但它是一种通用工具,对字符进行计数只是许多应用程序之一.

It's not because collections.Counter is slow, it's actually quite fast, but it's a general purpose tool, counting characters is just one of many applications.

另一方面,str.count只是对字符串中的字符进行计数,并且对其进行了优化().

On the other hand str.count just counts characters in strings and it's heavily optimized for its one and only task.

这意味着str.count可以在基础C- char数组上工作,同时可以避免在迭代过程中创建新的(或查找现有的)length-1-python字符串(这是forCounter做).

That means that str.count can work on the underlying C-char array while it can avoid creating new (or looking up existing) length-1-python-strings during the iteration (which is what for and Counter do).

只需在此语句中添加更多上下文即可.

Just to add some more context to this statement.

字符串存储为包装为python对象的C数组. str.count知道该字符串是一个连续的数组,因此将要转换为co的字符转换为C-字符",然后在本机C代码中遍历该数组并检查是否相等,最后包装并返回发现的事件.

A string is stored as C array wrapped as python object. The str.count knows that the string is a contiguous array and thus converts the character you want to co to a C-"character", then iterates over the array in native C code and checks for equality and finally wraps and returns the number of found occurrences.

另一方面,forCounter使用python-iteration-protocol.字符串中的每个字符都将被包装为python-object,然后将其(哈希和)在python中进行比较.

On the other hand for and Counter use the python-iteration-protocol. Each character of your string will be wrapped as python-object and then it (hashes and) compares them within python.

所以减速是因为:

  • 每个字符都必须转换为Python对象(这是导致性能下降的主要原因)
  • 循环是在Python中完成的(不适用于python 3.x中的Counter,因为它已用C重写)
  • 每个比较都必须在Python中完成(而不仅仅是比较C中的数字-字符由数字表示)
  • 计数器需要对值进行哈希处理,而循环需要对列表进行索引.
  • Each character has to be converted to a Python object (this is the major reason for the performance loss)
  • The loop is done in Python (not applicable to Counter in python 3.x because it was rewritten in C)
  • Each comparison has to be done in Python (instead of just comparing numbers in C - characters are represented by numbers)
  • The counter needs to hash the values and your loop needs to index your list.

请注意,变慢的原因类似于关于为什么Python的数组变慢的问题?.

Note the reason for the slowdown is similar to the question about Why are Python's arrays slow?.

我做了一些其他的基准测试,以找出在哪个点上collections.Counterstr.count更为可取.为此,我创建了包含不同数量的唯一字符的随机字符串,并绘制了性能:

I did some additional benchmarks to find out at which point collections.Counter is to be preferred over str.count. To this end I created random strings containing differing numbers of unique characters and plotted the performance:

from collections import Counter
import random
import string

characters = string.printable  # 100 different printable characters

results_counter = []
results_count = []
nchars = []

for i in range(1, 110, 10):
    chars = characters[:i]
    string = ''.join(random.choice(chars) for _ in range(10000))
    res1 = %timeit -o Counter(string)
    res2 = %timeit -o {char: string.count(char) for char in chars}
    nchars.append(len(chars))
    results_counter.append(res1)
    results_count.append(res2)

并使用绘制结果:

import matplotlib.pyplot as plt

plt.figure()

plt.plot(nchars, [i.best * 1000 for i in results_counter], label="Counter",   c='black')
plt.plot(nchars, [i.best * 1000 for i in results_count],   label="str.count", c='red')
plt.xlabel('number of different characters')
plt.ylabel('time to count the chars in a string of length 10000 [ms]')
plt.legend()

Python 3.5的结果

Python 3.6的结果非常相似,因此我没有明确列出它们.

Results for Python 3.5

The results for Python 3.6 are very similar so I didn't list them explicitly.

因此,如果要计算80个不同的字符,Counter将变得更快/可比,因为它仅遍历字符串一次,而不像str.count遍历字符串多次.这在很大程度上取决于字符串的长度(但是测试显示,差异只有+/- 2%很小).

So if you want to count 80 different characters Counter becomes faster/comparable because it traverses the string only once and not multiple times like str.count. This will be weakly dependent on the length of the string (but testing showed only a very weak difference +/-2%).

在Python-2.7中,collections.Counter是使用python(而不是C)实现的,并且速度要慢得多. str.countCounter的收支平衡点只能通过外推法估算,因为即使使用100个不同的字符,str.count仍然快6倍.

In Python-2.7 collections.Counter was implemented using python (instead of C) and is much slower. The break-even point for str.count and Counter can only be estimated by extrapolation because even with 100 different characters the str.count is still 6 times faster.

这篇关于为什么Collections.counter这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆