为什么collections.Counter比''.count要慢得多?” [英] Why is collections.Counter much slower than ''.count?
问题描述
我有一个简单的任务:计算每个字母在一个字符串中出现多少次。我使用了 Counter()
,但是在一个论坛上我看到了使用 dict()
/的信息 Counter()
比每个字母使用 string.count()
慢得多。我以为它只能对字符串进行一次遍历,而 string.count()
解决方案将不得不遍历它四次(在这种情况下)。为什么 Counter()
这么慢?
I have a simple task: To count how many times every letter occurs in a string. I've used a Counter()
for it, but on one forum I saw information that using dict()
/ Counter()
is much slower than using string.count()
for every letter. I thought that it would interate through the string only once, and the string.count()
solution would have to iterate through it four times (in this case). Why is Counter()
so slow?
>>> timeit.timeit('x.count("A");x.count("G");x.count("C");x.count("T")', setup="x='GAAAAAGTCGTAGGGTTCCTTCACTCGAGGAATGCTGCGACAGTAAAGGAGGCCACGTGGTTGAGAGTTCCTAAGCATTCGTATGTACACCCGGACTCGATGCACTCAAACGTGCTTAAGGGTAAAGAAGGTCGAGAGGTATACTGGGGCACTCCCCTTAGAATTATATCTTGGTCAACTACAATATGGATGGAAATTCTAAGCCGAAAACGACCCGCTAGCGGATTGTGTATGTATCACAACGGTTTCGGTTCATACGCAAAATCATCCCATTTCAAGGCCACTCAAGGACATGACGCCGTGCAACTCCGAGGACATCCCTCAGCGATTGATGCAACCTGGTCATCTAATAATCCTTAGAACGGATGTGCCCTCTACTGGGAGAGCCGGCTAGACTGGCATCTCGCGTTGTTCGTACGAGCTCCGGGCGCCCGGGCGGTGTACGTTGATGTACAGCCTAAGAGCTTTCCACCTATGCTACGAACTAATTTCCCGTCCATCGTTCCTCGGACTGAGGTCAAAGTAACCCGGAAGTACATGGATCAGATACACTCACAGTCCCCTTTAATGACTGAGCTGGACGCTATTGATTGCTTTATAAGTGTTATGGTGAACTCGAAGACTTAGCTAGGAATTTCGCTATACCCGGGTAATGAGCTTAATACCTCACAGCATGTACGCTCTGAATATATGTAGCGATGCTAGCGGAACGTAAGCGTGAGCGTTATGCAGGGCTCCGCACCTCGTGGCCACTCGCCCAATGCCCGAGTTTTTGAGCAATGCCATGCCCTCCAGGTGAAGCGTGCTGAATATGTTCCGCCTCCGCACACCTACCCTACGGGCCTTACGCCATAGCTGAGGATACGCGAGTTGGTTAGCGATTACGTCATTCCAGGTGGTCGTTC'", number=10000)
0.07911698750407936
>>> timeit.timeit('Counter(x)', setup="from collections import Counter;x='GAAAAAGTCGTAGGGTTCCTTCACTCGAGGAATGCTGCGACAGTAAAGGAGGCCACGTGGTTGAGAGTTCCTAAGCATTCGTATGTACACCCGGACTCGATGCACTCAAACGTGCTTAAGGGTAAAGAAGGTCGAGAGGTATACTGGGGCACTCCCCTTAGAATTATATCTTGGTCAACTACAATATGGATGGAAATTCTAAGCCGAAAACGACCCGCTAGCGGATTGTGTATGTATCACAACGGTTTCGGTTCATACGCAAAATCATCCCATTTCAAGGCCACTCAAGGACATGACGCCGTGCAACTCCGAGGACATCCCTCAGCGATTGATGCAACCTGGTCATCTAATAATCCTTAGAACGGATGTGCCCTCTACTGGGAGAGCCGGCTAGACTGGCATCTCGCGTTGTTCGTACGAGCTCCGGGCGCCCGGGCGGTGTACGTTGATGTACAGCCTAAGAGCTTTCCACCTATGCTACGAACTAATTTCCCGTCCATCGTTCCTCGGACTGAGGTCAAAGTAACCCGGAAGTACATGGATCAGATACACTCACAGTCCCCTTTAATGACTGAGCTGGACGCTATTGATTGCTTTATAAGTGTTATGGTGAACTCGAAGACTTAGCTAGGAATTTCGCTATACCCGGGTAATGAGCTTAATACCTCACAGCATGTACGCTCTGAATATATGTAGCGATGCTAGCGGAACGTAAGCGTGAGCGTTATGCAGGGCTCCGCACCTCGTGGCCACTCGCCCAATGCCCGAGTTTTTGAGCAATGCCATGCCCTCCAGGTGAAGCGTGCTGAATATGTTCCGCCTCCGCACACCTACCCTACGGGCCTTACGCCATAGCTGAGGATACGCGAGTTGGTTAGCGATTACGTCATTCCAGGTGGTCGTTC'", number=10000)
2.1727447831030844
>>> 2.1727447831030844 / 0.07911698750407936
27.462430656767047
>>>
推荐答案
Counter()
允许您计算任何可哈希对象,而不仅仅是子字符串。两种解决方案的时间均 O(n)
。您的测量结果表明,通过 Counter()
迭代和哈希单个字符的开销大于运行 s.count()
4次。
Counter()
allows you to count any hashable objects, not just substrings. Both solutions are O(n)
-time. Your measurements show that the overhead of iterating and hashing individual characters by Counter()
is greater than running s.count()
4 times.
Counter()
可以使用C帮助器对元素进行计数,但是似乎它不是特殊情况的字符串,并且使用适用于任何其他可迭代项的通用算法,即处理单个字符涉及多个Python C API调用,以推进迭代器,获取先前的值(在哈希表中进行查找),递增计数器,设置新值(在哈希表中进行查找):
Counter()
can use C helper to count elements but it seems it doesn't special case strings and uses general algorithm applicable for any other iterable i.e., processing a single character involves multiple Python C API calls to advance the iterator, get previous value (a lookup in the hash table), increment counter, set new value (a lookup in the hash table):
while (1) {
key = PyIter_Next(it);
if (key == NULL)
break;
oldval = PyObject_GetItem(mapping, key);
if (oldval == NULL) {
if (!PyErr_Occurred() || !PyErr_ExceptionMatches(PyExc_KeyError))
break;
PyErr_Clear();
Py_INCREF(one);
newval = one;
} else {
newval = PyNumber_Add(oldval, one);
Py_DECREF(oldval);
if (newval == NULL)
break;
}
if (PyObject_SetItem(mapping, key, newval) == -1)
break;
Py_CLEAR(newval);
Py_DECREF(key);
}
将其与 FASTSEARCH()
字节字符串的开销:
Compare it to FASTSEARCH()
overhead for bytestrings:
for (i = 0; i < n; i++)
if (s[i] == p[0]) {
count++;
if (count == maxcount)
return maxcount;
}
return count;
这篇关于为什么collections.Counter比''.count要慢得多?”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!