Python 的 collections.Counter 和 nltk.probability.FreqDist 的区别 [英] Difference between Python's collections.Counter and nltk.probability.FreqDist
问题描述
我想计算文本语料库中单词的词频.一段时间以来,我一直在使用 NLTK 的 word_tokenize 和probability.FreqDist 来完成这项工作.word_tokenize 返回一个列表,该列表由 FreqDist 转换为频率分布.但是,我最近在集合 (collections.Counter) 中遇到了 Counter 函数,它似乎在做完全相同的事情.FreqDist 和 Counter 都有一个 most_common(n) 函数,它返回 n 个最常见的词.有谁知道这两者之间有区别吗?一个比另一个快吗?是否存在一种情况下有效而另一种无效的情况?
nltk.probability.FreqDist
是 collections.Counter
的子类.
来自文档:
<块引用>实验结果的频率分布.一个频率分布记录了每个结果的次数实验已经发生.例如,频率分布可以用于记录文档中每个词类型的频率.形式上,频率分布可以定义为一个函数从每个样本到样本出现次数的映射作为结果.
代码中明确显示了继承基本上,Counter
和 FreqDist
的初始化方式没有区别,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L106
所以速度方面,创建Counter
和FreqDist
应该是一样的.速度上的差异应该是微不足道的,但值得注意的是,开销可能是:
- 在解释器中定义类时的编译
- duck-typing 的成本
.__init__()
主要区别在于 FreqDist
为统计/概率自然语言处理 (NLP) 提供的各种功能,例如寻找意外事件.FreqDist
扩展 Counter
的完整函数列表如下:
说到使用FreqDist.most_common()
,实际上是使用了Counter
的父函数,所以检索排序后的most_common
的速度> 两种类型的列表相同.
就个人而言,当我只想检索计数时,我使用 collections.Counter
.但是当我需要进行一些统计操作时,我要么使用 nltk.FreqDist
,要么将 Counter
转储到 pandas.DataFrame
(请参阅将 Counter 对象转换为 Pandas DataFrame).
I want to calculate the term-frequencies of words in a text corpus. I've been using NLTK's word_tokenize followed by probability.FreqDist for some time to get this done. The word_tokenize returns a list, which is converted to a frequency distribution by FreqDist. However, I recently came across the Counter function in collections (collections.Counter), which seems to be doing the exact same thing. Both FreqDist and Counter have a most_common(n) function which return the n most common words. Does anyone know if there's a difference between these two? Is one faster than the other? Are there cases where one would work and the other wouldn't?
nltk.probability.FreqDist
is a subclass of collections.Counter
.
From the docs:
A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.
The inheritance is explicitly shown from the code and essentially, there's no difference in terms of how a Counter
and FreqDist
is initialized, see https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L106
So speed-wise, creating a Counter
and FreqDist
should be the same. The difference in speed should be insignificant but it's good to note that the overheads could be:
- the compilation of the class in when defining it in an interpreter
- the cost of duck-typing
.__init__()
The major difference is the various functions that FreqDist
provides for statistical / probabilistic Natural Language Processing (NLP), e.g. finding hapaxes. The full list of functions that FreqDist
extends Counter
are as followed:
>>> from collections import Counter
>>> from nltk import FreqDist
>>> x = FreqDist()
>>> y = Counter()
>>> set(dir(x)).difference(set(dir(y)))
set(['plot', 'hapaxes', '_cumulative_frequencies', 'r_Nr', 'pprint', 'N', 'unicode_repr', 'B', 'tabulate', 'pformat', 'max', 'Nr', 'freq', '__unicode__'])
When it comes to using FreqDist.most_common()
, it's actually using the parent function from Counter
so the speed of retrieving the sorted most_common
list is the same for both types.
Personally, when I just want to retrieve counts, I use collections.Counter
. But when I need to do some statistical manipulation, I either use nltk.FreqDist
or I would dump the Counter
into a pandas.DataFrame
(see Transform a Counter object into a Pandas DataFrame).
这篇关于Python 的 collections.Counter 和 nltk.probability.FreqDist 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!