有没有办法使collections.Counter(Python2.7)知道其输入列表已排序? [英] Is there a way to make collections.Counter (Python2.7) aware that its input list is sorted?

查看:82
本文介绍了有没有办法使collections.Counter(Python2.7)知道其输入列表已排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在用不同的方式(在Python 2.7中)从语料库中提取(单词,频率)元组列表,或者字符串列表,并比较其效率。据我所知,在正常情况下使用未排序的列表,集合模块中的计数器方法是它比我在其他地方想出或发现的任何东西都优越,但是它似乎并没有充分利用预排序列表的优势,在这种特殊情况下,我想出了可以轻松击败它的方法。因此,简而言之,是否有任何内置的方式通知 Counter 列表已被排序以进一步加快其速度的事实?

I've been playing around with different ways (in Python 2.7) to extract a list of (word, frequency) tuples from a corpus, or list of strings, and comparing their efficiency. As far as I can tell, in the normal case with an unsorted list, the Countermethod from the collections module is superior to anything I've come up with or found elsewhere, but it doesn't seem to take much advantage of the benefits of a pre-sorted list and I've come up with methods that beat it easily in this special case. So, in short, is there any built-in way to inform Counter of the fact that a list is already sorted to further speed it up?

(下一部分是Counter发挥神奇作用的未排序列表;在处理排序列表时,您可能想跳到最后失去魅力的地方。)

(The next section is on unsorted lists where Counter works magic; you may want to skip to towards the end where it looses its charm when dealing with sorted lists.)

天真的方法将使用 sorted([((word,corpus.count(word))表示set(corpus))中的单词]),但是可靠地使您陷入运行时问题,因为只要您的语料库长了几千个项目-毫不奇怪,因为您要遍历整个m个单词,所以m是唯一单词的数目。

The naive approach would be to use sorted([(word, corpus.count(word)) for word in set(corpus)]), but that one reliably gets you into runtime problems as soon as your corpus is a few thousand items long - not surprisingly since you're running through the entire list of n words m many times, where m is the number of unique words.

所以我尝试做的是找到 Counter 之前首先对输入列表进行排序,以确保所有搜索都严格在本地进行(我也必须删除数字和标点符号,并将所有条目都转换为小写,以避免重复使用 foo, Foo和 foo:)。

So what I tried to do instead before I found Counter was make sure that all searches are strictly local by first sorting the input list (I also have to remove digits and punctuation marks and convert all entries into lower case to avoid duplicates like 'foo', 'Foo', and 'foo:').

#Natural Language Toolkit, for access to corpus; any other source for a long text will do, though.
import nltk 

# nltk corpora come as a class of their own, as I udnerstand it presenting to the
# outside as a unique list but underlyingly represented as several lists, with no more
# than one ever loaded into memory at any one time, which is good for memory issues 
# but rather not so for speed so let's disable this special feature by converting it
# back into a conventional list:
corpus = list(nltk.corpus.gutenberg.words()) 

import string
drop = string.punctuation+string.digits  

def wordcount5(corpus, Case=False, lower=False, StrippedSorted=False):
    '''function for extracting word frequencies out of a corpus. Returns an alphabetic list
    of tuples consisting of words contained in the corpus with their frequencies.  
    Default is case-insensitive, but if you need separate entries for upper and lower case 
    spellings of the same words, set option Case=True. If your input list is already sorted
    and stripped of punctuation marks/digits and/or all lower case, you can accelerate the 
    operation by a factor of 5 or so by declaring so through the options "Sorted" and "lower".'''
    # you can ignore the following 6 lines for now, they're only relevant with a pre-processed input list
    if lower or Case:
        if StrippedSorted:
            sortedc = corpus 
        else:    
            sortedc = sorted([word.replace('--',' ').strip(drop)
                   for word in sorted(corpus)])
    # here we sort and purge the input list in the default case:
    else:
            sortedc = sorted([word.lower().replace('--',' ').strip(drop)
                   for word in sorted(corpus)])
    # start iterating over the (sorted) input list:
    scindex = 0
    # create a list:
    freqs = []
    # identify the first token:
    currentword = sortedc[0]
    length = len(sortedc)
    while scindex < length:
        wordcount = 0
        # increment a local counter while the tokens == currentword
        while scindex < length and sortedc[scindex] == currentword:
            scindex += 1
            wordcount += 1
        # store the current word and final score when a) a new word appears or
        # b) the end of the list is reached
        freqs.append((currentword, wordcount))
        # if a): update currentword with the current token
        if scindex < length:
            currentword = sortedc[scindex]
    return freqs



查找 collections.Counter



这要好得多,但仍不如使用collections模块中的Counter类快,这样会创建一个{word:word frequency}条目的字典(我们仍然必须进行相同的剥离和降低操作,但不进行排序):

Finding collections.Counter

This is much better, but still not quite as fast as using the Counter class from the collections module, which creates a dictionary of {word: frequency of word} entries (we still have to do the same stripping and lowering, but no sorting):

from collections import Counter
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
    cnt[word] += 1
# optionally, if we want to have the same output format as before,
# we can do the following which negligibly adds in runtime:
wordfreqs = sorted([(word, cnt[word]) for word in cnt])

在Gutenberg语料库上与约200万个条目,Counter方法在我的计算机上大约快30%(5秒,而不是7.2),这主要是通过耗时约2.1秒的排序例程来解释的(如果您没有,也不想安装nltk软件包(自然语言工具包),该软件包可访问该语料库,任何其他足够长的文本(在单词级别上适当分割成字符串列表也将显示相同的意思)。)

On the Gutenberg corpus with appr. 2 million entries, the Counter method is roughly 30% faster on my machine (5 seconds as opposed to 7.2), which is mostly explained through the sorting routine which eats around 2.1 seconds (if you don't have and don't want to install the nltk package (Natural Language Toolkit) which offers access to this corpus, any other adequately long text appropriately split into a list of strings at word level will show you the same.)

使用我的特殊计时方法,使用重言式作为延迟执行的条件,这为我们提供了计数器方法:

With my idiosyncratic method of timing using the tautology as a conditional to delay execution, this gives us for the counter method:

import time
>>> if 1:
...     start = time.time()
...     cnt = Counter()
...     for word in [token.lower().strip(drop) for token in corpus if token not in [" ", ""]]:
...         cnt[word] += 1
...     time.time()-start
...     cntgbfreqs = sorted([(word, cnt[word]) for word in cnt])
...     time.time()-start
... 
4.999882936477661
5.191655874252319

(我们看到了最后一步,即将结果格式化为元组列表,

(We see that the last step, that of formatting the results as a list of tuples, takes up less than 5% of the total time.)

与我的函数相比:

>>> if 1:
...     start = time.time()
...     gbfreqs = wordcount5(corpus)
...     time.time()-start
... 
7.261770963668823



已排序的输入列表-当计数器'失败'



但是,您可能已经注意到,我的函数允许指定输入已经排序,去除标点垃圾并转换为小写。如果我们已经为其他操作创建了列表的转换后的版本,则使用它(并声明)可以大大加快我的 wordcount5 的操作:

>>> sorted_corpus = sorted([token.lower().strip(drop) for token in corpus if token not in [" ", ""]])
>>> if 1:
...     start = time.time()
...     strippedgbfreqs2 = wordcount5(sorted_corpus, lower=True, StrippedSorted=True)
...     time.time()-start
... 
0.9050078392028809

在这里,我们减少了运行时间通过适当的因素8,无需对语料库进行排序和转换项目。当然,当向 Counter 提供这个新列表时,后者也是正确的,因此可以预期它也更快一些,但是似乎并没有利用它进行排序,现在所需的时间是我的函数的两倍,而之前的速度是以前的30%:

Here, we've reduced runtime by a factor of appr. 8 by not having to sort the corpus and convert the items. Of course the latter is also true when feeding Counter with this new list, so expectably it's also a bit faster, but it doesn't seem to take advantage of the fact that it is sorted and now takes twice as long as my function where it was 30% faster before:

>>> if 1:
...     start = time.time()
...     cnt = Counter()
...     for word in sorted_corpus:
...         cnt[word] += 1
...     time.time()-start
...     strippedgbfreqs = [(word, cnt[word])for word in cnt]
...     time.time()-start
... 
1.9455058574676514
2.0096349716186523

当然,我们可以使用与 wordcount5 相同的逻辑-增加一个本地计数器,直到遇到一个新单词,然后将最后一个单词与当前单词一起存储状态,然后将下一个单词的计数器重置为0-仅使用 Counter 作为存储空间,但使用 Counter 的固有效率方法似乎丢失了,性能在我创建字典的功能范围之内,转换成元组列表的额外负担现在比处理原始语料库时更麻烦:

Of course, we can use the same logic I used in wordcount5 - incrementing a local counter until we run into a new word and only then storing the last word with the current state of the counter, and resetting the counter to 0 for the next word - only using Counter as storage, but the inherent efficiency of the Counter method seems lost, and performance is within the range of my function's for creating a dictionary, with the extra burden of converting to a list of tuples now looking more troublesome than it used to when we were processing the raw corpus:

>>> def countertest():
...     start = time.time()
...     sortedcnt = Counter()
...     c = 0
...     length = len(sorted_corpus)
...     while c < length:
...         wcount = 0
...         word = sorted_corpus[c]
...         while c < length and sorted_corpus[c] == word:
...             wcount+=1
...             c+=1
...         sortedcnt[word] = wcount
...         if c < length:
...             word = sorted_corpus[c]
...     print time.time()-start
...     return sorted([(word, sortedcnt[word]) for word in sortedcnt])
...     print time.time()-start
... 
>>> strippedbgcnt = countertest()
0.920727014542
1.08029007912

(结果相似这并不奇怪,因为我们实际上是禁用了 Counter 自己的方法,并将其滥用为存储与以前相同的方法获得的值的方法。)

(The similarity of the results is not really surprising since we're in effect disabling Counter's own methods and abusing it as a store for values obtained with the very same methodology as before.)

因此,我的问题:是否有一种更惯用的方式来通知 Counter 其输入列表已被排序并使之从而将当前密钥保留在内存中,而不是每次可预测地遇到相同单词的下一个标记时都重新查找?换句话说,可以通过结合 Counter / dictionary <的固有效率来进一步提高预排序列表的性能。 / code>类具有排序列表的明显优点,或者我是否已经为计算200万个条目列表花了0.9秒的硬限制?

Therefore, my question: Is there a more idiomatic way to inform Counter that its input list is already sorted and to make it thus keep the current key in memory rather than looking it up anew every time it - predictably - encounters the next token of the same word? In other words, is it possible to improve performance on a pre-sorted list further by combining the inherent efficiency of the Counter/dictionary class with the obvious benefits of a sorted list, or am I already scratching on a hard limit with .9 seconds for tallying a list of 2M entries?

可能没有很大的改进空间-当我做最简单的事情时,我得到的时间约为.55秒,这仍然需要迭代相同的列表并检查每个单独的值,并为 set(corpus)添加.25,但不作任何计数,但是也许有一些itertools魔术可以帮助您接近那些数字?

There probably isn't a whole lot of room for improvement - I get times of around .55 seconds when doing the simplest thing I can think of that still requires iterating through the same list and checking each individual value, and .25 for set(corpus) without a count, but maybe there's some itertools magic out there that would help to get close to those figures?

(注意:我是Python和一般编程的相对新手,所以如果我错过了明显的东西,请谅解。)

(Note: I'm a relative novice to Python and to programming in general, so excuse if I've missed something obvious.)

另一件事,就是排序本身使我上面的所有方法变慢,正在转换每个将2M字符串中的一个字符串转换为小写字母,并去除其中可能包含的所有标点符号或数字。我之前曾尝试过通过计算未处理的字符串,然后转换结果并删除重复项,同时累加它们的计数来简化操作,但是我一定做错了什么,因为它使事情变得如此缓慢。因此,我恢复了以前的版本,转换了原始语料库中的所有内容,现在无法完全重建我在那里所做的工作。

Another thing, besides the sorting itself, which makes all of my methods above slow, is converting every single one of 2M strings into lowercase and stripping them of whatever punctuation or digits they may include. I have tried before to shortcut that by counting the unprocessed strings and only then converting the results and removing duplicates while adding up their counts, but I must have done something wrong for it made things ever so slightly slower. I therefore reverted to the previous versions, converting everything in the raw corpus, and now can't quite reconstruct what I did there.

如果我现在尝试一下,从最后一次转换字符串中确实可以得到改进。我仍然通过遍历(结果列表)来做到这一点。我所做的是编写了两个函数,它们之间将转换JF Sebastian获奖的default_dict方法的输出中的键(格式为 [( word,int),( Word,int) ],( word2,int),...] )转换为小写字母并去除标点符号,然后折叠该操作后所有相同键的计数(下面的代码)。这样做的好处是,我们现在可以处理大约5万个条目的列表,而不是语料库中的> 2M。这样,我现在从语料库(作为列表)到不区分大小写的单词计数(忽略我的机器上的标点符号)的时间为1.25秒,而使用Counter方法和字符串转换的第一步是大约4.5秒。但是也许 sum_sorted()中还有一种基于字典的方法吗?

If I try it now, I do get an improvement from converting the strings last. I'm still doing it by looping over a list (of results). What I did was write a couple of functions that would between them convert the keys in the output of J.F. Sebastian's winning default_dict method (of format [("word", int), ("Word", int)], ("word2", int),...]) into lowercase and strip them of punctuation, and collapse the counts for all keys that were left identical after that operation (code below). The advantage is that we're now handling a list of roughly 50k entries as opposed to the > 2M in the corpus. This way I'm now at 1.25 seconds for going from the corpus (as a list) to a case insensitive word count ignoring punctuation marks on my machine, down from about 4.5 with the Counter method and string conversion as a first step. But maybe there's a dictionary-based method for what I'm doing in sum_sorted() as well?

代码:

def striplast(resultlist, lower_or_Case=False):
    """function for string conversion of the output of any of the `count_words*` methods"""
    if lower_or_Case:
        strippedresult = sorted([(entry[0].strip(drop), entry[1]) for entry in resultlist])
    else:
        strippedresult = sorted([(entry[0].lower().strip(drop), entry[1]) for entry in resultlist])
    strippedresult = sum_sorted(strippedresult)
    return strippedresult

def sum_sorted(inputlist):
    """function for collapsing the counts of entries left identical by striplast()"""
    ilindex = 0
    freqs = []
    currentword = inputlist[0][0]
    length = len(inputlist)
    while ilindex < length:
        wordcount = 0
        while ilindex < length and inputlist[ilindex][0] == currentword:
            wordcount += inputlist[ilindex][1]
            ilindex += 1
        if currentword not in ["", " "]:
            freqs.append((currentword, wordcount))
        if ilindex < length and inputlist[ilindex][0] > currentword:
            currentword = inputlist[ilindex][0]
    return freqs

def count_words_defaultdict2(words, loc=False): 
    """modified version of J.F. Sebastian's winning method, added a final step collapsing
    the counts for words identical except for punctuation and digits and case (for the 
    latter, unless you specify that you're interested in a case-sensitive count by setting
    l(ower_)o(r_)c(ase) to True) by means of striplast()."""
    d = defaultdict(int)
    for w in words:
        d[w] += 1
    if col=True:
        return striplast(sorted(d.items()), lower_or_case=True)
    else:
        return striplast(sorted(d.items()))

我第一次尝试使用groupy来完成当前完成的工作通过 sum_sorted()和/或 striplast()编写,但我无法弄清楚如何欺骗它汇总为 [entry [1]] 以获得ent的列表结果中, count_words 中的项目按 entry [0] 排序。我最接近的是:

I made some first attempts at using groupy to do the job currently done by sum_sorted(), and/or striplast(), but I couldn't quite work out how to trick it into summing [entry[1]] for a list of entries in count_words' results sorted by entry[0]. The closest I got was:

# "i(n)p(ut)list", toylist for testing purposes:

list(groupby(sorted([(entry[0].lower().strip(drop), entry[1]) for entry in  iplist])))

# returns:

[(('a', 1), <itertools._grouper object at 0x1031bb290>), (('a', 2), <itertools._grouper object at 0x1031bb250>), (('a', 3), <itertools._grouper object at 0x1031bb210>), (('a', 5), <itertools._grouper object at 0x1031bb2d0>), (('a', 8), <itertools._grouper object at 0x1031bb310>), (('b', 3), <itertools._grouper object at 0x1031bb350>), (('b', 7), <itertools._grouper object at 0x1031bb390>)]

# So what I used instead for striplast() is based on list comprehension:

list(sorted([(entry[0].lower().strip(drop), entry[1]) for entry in  iplist]))

# returns:

[('a', 1), ('a', 2), ('a', 3), ('a', 5), ('a', 8), ('b', 3), ('b', 7)]


推荐答案

回答标题中的问题:counter,dict,defaultdict,OrderedDict是基于哈希的类型:查找它们计算哈希的项目并使用它来获取物品。只要它们是可哈希的,它们甚至支持没有定义顺序的键,即,Counter无法利用预先排序的输入。

To answer the question from the title: Counter, dict, defaultdict, OrderedDict are hash-based types: to look up an item they compute a hash for a key and use it to get the item. They even support keys that have no defined order as long as they are hashable i.e., Counter can't take advantage of pre-sorted input.

测量结果表明排序输入单词所需的时间比使用基于字典的方法对单词进行计数和对组合的结果进行排序所需的时间更长:

The measurements show that the sorting of input words takes longer than to count the words using dictionary-based approach and to sort the result combined:

sorted                  3.19
count_words_Counter     2.88
count_words_defaultdict 2.45
count_words_dict        2.58
count_words_groupby     3.44
count_words_groupby_sum 3.52

使用 groupby()对已经排序的输入中的单词计数也只花费对输入进行排序的时间的一小部分,比基于字典的方法要快。

Also the counting of words in already sorted input with groupby() takes only fraction of the time it takes to sort the input in the first place and faster than dict-based approaches.

def count_words_Counter(words):
    return sorted(Counter(words).items())

def count_words_groupby(words):
    return [(w, len(list(gr))) for w, gr in groupby(sorted(words))]

def count_words_groupby_sum(words):
    return [(w, sum(1 for _ in gr)) for w, gr in groupby(sorted(words))]

def count_words_defaultdict(words):
    d = defaultdict(int)
    for w in words:
        d[w] += 1
    return sorted(d.items())

def count_words_dict(words):
    d = {}
    for w in words:
        try:
            d[w] += 1
        except KeyError:
            d[w] = 1
    return sorted(d.items())

def _count_words_freqdist(words):
    # note: .items() returns words sorted by word frequency (descreasing order)
    #       (same as `Counter.most_common()`)
    #       so the code sorts twice (the second time in lexicographical order)
    return sorted(nltk.FreqDist(words).items())

要重现结果,请运行此代码

注:如果将nltk的惰性单词序列转换为列表,则速度会提高3倍( WORDS = list(nltk.corpus.gutenberg.words()),但相对效果相同:

Note: It is 3 times faster if nltk's lazy sequence of words is converted to a list (WORDS = list(nltk.corpus.gutenberg.words()) but relative performance is the same:

sorted                  1.22
count_words_Counter     0.86
count_words_defaultdict 0.48
count_words_dict        0.54
count_words_groupby     1.49
count_words_groupby_sum 1.55

结果相似到> Python-字典查找频率是否缓慢

如果要对单词进行规范化(删除标点符号,将它们变为小写字母等);查看在Python中将字符串转换为所有小写字母以去除所有非ASCII字母字符的最有效方法是什么?一些示例:

If you want to normalize the words (remove punctuation, make them lowercase, etc); see answers to What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?. Some examples:

def toascii_letter_lower_genexpr(s, _letter_set=ascii_lowercase):
    """
    >>> toascii_letter_lower_genexpr("ABC,-.!def")
    'abcdef'
    """
    return ''.join(c for c in s.lower() if c in _letter_set)

def toascii_letter_lower_genexpr_set(s, _letter_set=set(ascii_lowercase)):
    return ''.join(c for c in s.lower() if c in _letter_set)

def toascii_letter_lower_translate(s,
    table=maketrans(ascii_letters, ascii_lowercase * 2),
    deletechars=''.join(set(maketrans('', '')) - set(ascii_letters))):
    return s.translate(table, deletechars)

def toascii_letter_lower_filter(s, _letter_set=set(ascii_letters)):
    return filter(_letter_set.__contains__, s).lower()

同时对单词进行计数和规范化:

To count and normalize the words simultaneously:

def combine_counts(items):
    d = defaultdict(int)
    for word, count in items:
        d[word] += count
    return d.iteritems()

def clean_words_in_items(clean_word, items):
    return ((clean_word(word), count) for word, count in items)

def normalize_count_words(words):
    """Normalize then count words."""
    return count_words_defaultdict(imap(toascii_letter_lower_translate, words))

def count_normalize_words(words):
    """Count then normalize words."""
    freqs = count_words_defaultdict(words)
    freqs = clean_words_in_items(toascii_letter_lower_translate, freqs)
    return sorted(combine_counts(freqs))

结果

我已经更新了基准以测量各种 count_words *() toascii *()函数的组合(未显示5x4对):

I've updated the benchmark to measure various combinations of count_words*() and toascii*() functions (5x4 pairs not shown):

toascii_letter_lower_filter      0.954 usec small
toascii_letter_lower_genexpr     2.44 usec small
toascii_letter_lower_genexpr_set 2.19 usec small
toascii_letter_lower_translate   0.633 usec small

toascii_letter_lower_filter      124 usec random 2000
toascii_letter_lower_genexpr     197 usec random 2000
toascii_letter_lower_genexpr_set 121 usec random 2000
toascii_letter_lower_translate   7.73 usec random 2000

sorted                  1.28 sec 
count_words_Counter     941 msec 
count_words_defaultdict 501 msec 
count_words_dict        571 msec 
count_words_groupby     1.56 sec 
count_words_groupby_sum 1.64 sec 

count_normalize_words 622 msec 
normalize_count_words 2.18 sec 

最快的方法:


  • 规范化单词- toascii_letter_lower_translate ()

计数词(预排序输入)- groupby()基于方法

count words (presorted input) - groupby()-based approach

字数- count_words_defaultdict()

首先对单词计数然后对它们进行规范化会更快- count_normalize_words()

it is faster first to count the words and then to normalize them - count_normalize_words()

该代码的最新版本: count-words-performance.py

这篇关于有没有办法使collections.Counter(Python2.7)知道其输入列表已排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆