为什么 Collections.counter 这么慢? [英] Why is Collections.counter so slow?

查看:23
本文介绍了为什么 Collections.counter 这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解决一个 Rosalind 基本问题,即计算给定序列中的核苷酸数,并在列表中返回结果.对于那些不熟悉生物信息学的人,它只是计算字符串中 4 个不同字符('A'、'C'、'G'、'T')的出现次数.

我希望 collections.Counter 是最快的方法(首先是因为他们声称具有高性能,其次是因为我看到很多人使用它来解决这个特定问题).

但令我惊讶的是这种方法是最慢的

我比较了三种不同的方法,使用timeit并运行两种类型的实验:

  • 多次运行长序列
  • 多次运行短序列.

这是我的代码:

导入时间从集合导入计数器# 方法一:使用计数定义方法1(seq):返回 [seq.count('A'), seq.count('C'), seq.count('G'), seq.count('T')]# 方法二:使用循环定义方法2(seq):r = [0, 0, 0, 0]对于序列中的 i:如果我 == 'A':r[0] += 1elif i == 'C':r[1] += 1elif i == 'G':r[2] += 1别的:r[3] += 1返回# 方法 3:使用 Collections.counter定义方法3(seq):计数器 = 计数器(序列)返回 [counter['A'], counter['C'], counter['G'], counter['T']]如果 __name__ == '__main__':# 长虚拟序列long_seq = 'ACAGCATGCA' * 10000000# 短虚拟序列short_seq = 'ACAGCATGCA' * 1000# 测试 1:运行一次长序列打印 timeit.timeit("method1(long_seq)", setup='from __main__ import method1, long_seq', number=1)打印 timeit.timeit("method2(long_seq)", setup='from __main__ import method2, long_seq', number=1)打印 timeit.timeit("method3(long_seq)", setup='from __main__ import method3, long_seq', number=1)# 测试 2:多次运行短序列打印 timeit.timeit("method1(short_seq)", setup='from __main__ import method1, short_seq', number=10000)打印 timeit.timeit("method2(short_seq)", setup='from __main__ import method2, short_seq', number=10000)打印 timeit.timeit("method3(short_seq)", setup='from __main__ import method3, short_seq', number=10000)

结果:

Test1:方法一:0.224009990692方法二:13.7929501534方法三:18.9483819008测试2:方法一:0.224207878113方法二:13.8520510197方法三:18.9861831665

对于这两个实验,方法 1 比方法 2 和 3 快!!

所以我有一系列问题:

  • 我做错了什么还是确实比其他两种方法慢?有人可以运行相同的代码并分享结果吗?

  • 如果我的结果是正确的(也许这应该是另一个问题)有没有比使用方法 1 更快的方法来解决这个问题?

  • 如果count更快,那么collections.Counter有什么用?

解决方案

不是因为 collections.Counter 慢,其实还挺快的,但它是一个通用工具,计数字符只是一个许多应用程序.

另一方面,str.count 只计算字符串中的字符数,并且大量针对它的唯一任务进行了优化.

这意味着 str.count 可以处理底层的 C-char 数组,同时它可以避免创建新的(或查找现有的)length-1-python-迭代期间的字符串(这就是 forCounter 所做的).

<小时>

只是在此声明中添加更多上下文.

一个字符串存储为 C 数组,包装为 python 对象.str.count 知道字符串是一个连续的数组,因此将您想要的字符转换为 C-字符",然后在本机 C 代码中遍历数组并检查相等性和finally 包装并返回找到的出现次数.

另一方面 forCounter 使用 python-iteration-protocol.字符串的每个字符都将被包装为 python 对象,然后它(散列和)在 python 中比较它们.

所以放缓是因为:

  • 每个字符都必须转换为 Python 对象(这是性能损失的主要原因)
  • 循环是用Python完成的(不适用于python 3.x中的Counter,因为它是用C重写的)
  • 每次比较都必须在 Python 中完成(而不仅仅是在 C 中比较数字 - 字符由数字表示)
  • 计数器需要对值进行哈希处理,而您的循环需要为您的列表编制索引.
<小时>

注意速度变慢的原因类似于关于

因此,如果您想计算 80 个不同的字符,Counter 变得更快/具有可比性,因为它只遍历字符串一次,而不是像 str.count 那样多次遍历.这将微弱地依赖于字符串的长度(但测试显示只有 +/-2% 的非常微弱的差异).

Python 2.7 的结果

在 Python-2.7 中 collections.Counter 是使用 python(而不是 C)实现的,速度要慢得多.str.countCounter 的盈亏平衡点只能通过外推法估计,因为即使有 100 个不同的字符,str.count 仍然是快 6 倍.

I'm trying to solve a Rosalind basic problem of counting nucleotides in a given sequence, and returning the results in a list. For those ones not familiar with bioinformatics it's just counting the number of occurrences of 4 different characters ('A','C','G','T') inside a string.

I expected collections.Counter to be the fastest method (first because they claim to be high-performance, and second because I saw a lot of people using it for this specific problem).

But to my surprise this method is the slowest!

I compared three different methods, using timeit and running two types of experiments:

  • Running a long sequence few times
  • Running a short sequence a lot of times.

Here is my code:

import timeit
from collections import Counter

# Method1: using count
def method1(seq):
    return [seq.count('A'), seq.count('C'), seq.count('G'), seq.count('T')]

# method 2: using a loop
def method2(seq):
    r = [0, 0, 0, 0]
    for i in seq:
        if i == 'A':
            r[0] += 1
        elif i == 'C':
            r[1] += 1
        elif i == 'G':
            r[2] += 1
        else:
            r[3] += 1
    return r

# method 3: using Collections.counter
def method3(seq):
    counter = Counter(seq)
    return [counter['A'], counter['C'], counter['G'], counter['T']]


if __name__ == '__main__':

    # Long dummy sequence
    long_seq = 'ACAGCATGCA' * 10000000
    # Short dummy sequence
    short_seq = 'ACAGCATGCA' * 1000

    # Test 1: Running a long sequence once
    print timeit.timeit("method1(long_seq)", setup='from __main__ import method1, long_seq', number=1)
    print timeit.timeit("method2(long_seq)", setup='from __main__ import method2, long_seq', number=1)
    print timeit.timeit("method3(long_seq)", setup='from __main__ import method3, long_seq', number=1)

    # Test2: Running a short sequence lots of times
    print timeit.timeit("method1(short_seq)", setup='from __main__ import method1, short_seq', number=10000)
    print timeit.timeit("method2(short_seq)", setup='from __main__ import method2, short_seq', number=10000)
    print timeit.timeit("method3(short_seq)", setup='from __main__ import method3, short_seq', number=10000)

Results:

Test1: 
Method1: 0.224009990692
Method2: 13.7929501534
Method3: 18.9483819008

Test2:
Method1: 0.224207878113
Method2: 13.8520510197
Method3: 18.9861831665

Method 1 is way faster than method 2 and 3 for both experiments!!

So I have a set of questions:

  • Am I doing something wrong or it is indeed slower than the other two approaches? Could someone run the same code and share the results?

  • In case my results are correct, (and maybe this should be another question) is there a faster method to solve this problem than using method 1?

  • If count is faster, then what's the deal with collections.Counter?

解决方案

It's not because collections.Counter is slow, it's actually quite fast, but it's a general purpose tool, counting characters is just one of many applications.

On the other hand str.count just counts characters in strings and it's heavily optimized for its one and only task.

That means that str.count can work on the underlying C-char array while it can avoid creating new (or looking up existing) length-1-python-strings during the iteration (which is what for and Counter do).


Just to add some more context to this statement.

A string is stored as C array wrapped as python object. The str.count knows that the string is a contiguous array and thus converts the character you want to co to a C-"character", then iterates over the array in native C code and checks for equality and finally wraps and returns the number of found occurrences.

On the other hand for and Counter use the python-iteration-protocol. Each character of your string will be wrapped as python-object and then it (hashes and) compares them within python.

So the slowdown is because:

  • Each character has to be converted to a Python object (this is the major reason for the performance loss)
  • The loop is done in Python (not applicable to Counter in python 3.x because it was rewritten in C)
  • Each comparison has to be done in Python (instead of just comparing numbers in C - characters are represented by numbers)
  • The counter needs to hash the values and your loop needs to index your list.

Note the reason for the slowdown is similar to the question about Why are Python's arrays slow?.


I did some additional benchmarks to find out at which point collections.Counter is to be preferred over str.count. To this end I created random strings containing differing numbers of unique characters and plotted the performance:

from collections import Counter
import random
import string

characters = string.printable  # 100 different printable characters

results_counter = []
results_count = []
nchars = []

for i in range(1, 110, 10):
    chars = characters[:i]
    string = ''.join(random.choice(chars) for _ in range(10000))
    res1 = %timeit -o Counter(string)
    res2 = %timeit -o {char: string.count(char) for char in chars}
    nchars.append(len(chars))
    results_counter.append(res1)
    results_count.append(res2)

and the result was plotted using :

import matplotlib.pyplot as plt

plt.figure()

plt.plot(nchars, [i.best * 1000 for i in results_counter], label="Counter",   c='black')
plt.plot(nchars, [i.best * 1000 for i in results_count],   label="str.count", c='red')
plt.xlabel('number of different characters')
plt.ylabel('time to count the chars in a string of length 10000 [ms]')
plt.legend()

Results for Python 3.5

The results for Python 3.6 are very similar so I didn't list them explicitly.

So if you want to count 80 different characters Counter becomes faster/comparable because it traverses the string only once and not multiple times like str.count. This will be weakly dependent on the length of the string (but testing showed only a very weak difference +/-2%).

Results for Python 2.7

In Python-2.7 collections.Counter was implemented using python (instead of C) and is much slower. The break-even point for str.count and Counter can only be estimated by extrapolation because even with 100 different characters the str.count is still 6 times faster.

这篇关于为什么 Collections.counter 这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆