Python 的 collections.Counter 和 nltk.probability.FreqDist 的区别 [英] Difference between Python's collections.Counter and nltk.probability.FreqDist

查看:25
本文介绍了Python 的 collections.Counter 和 nltk.probability.FreqDist 的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算文本语料库中单词的词频.一段时间以来,我一直在使用 NLTK 的 word_tokenize 和probability.FreqDist 来完成这项工作.word_tokenize 返回一个列表,该列表由 FreqDist 转换为频率分布.但是,我最近在集合 (collections.Counter) 中遇到了 Counter 函数,它似乎在做完全相同的事情.FreqDist 和 Counter 都有一个 most_common(n) 函数,它返回 n 个最常见的词.有谁知道这两者之间有区别吗?一个比另一个快吗?是否存在一种情况下有效而另一种无效的情况?

解决方案

nltk.probability.FreqDistcollections.Counter 的子类.

来自文档:

<块引用>

实验结果的频率分布.一个频率分布记录了每个结果的次数实验已经发生.例如,频率分布可以用于记录文档中每个词类型的频率.形式上,频率分布可以定义为一个函数从每个样本到样本出现次数的映射作为结果.

代码中明确显示了继承基本上,CounterFreqDist 的初始化方式没有区别,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L106

所以速度方面,创建CounterFreqDist 应该是一样的.速度上的差异应该是微不足道的,但值得注意的是,开销可能是:

  • 在解释器中定义类时的编译
  • duck-typing 的成本 .__init__()

主要区别在于 FreqDist 为统计/概率自然语言处理 (NLP) 提供的各种功能,例如寻找意外事件.FreqDist 扩展 Counter 的完整函数列表如下:

<预><代码>>>>从集合导入计数器>>>从 nltk 导入 FreqDist>>>x = FreqDist()>>>y = 计数器()>>>设置(目录(x)).差异(设置(目录(y)))set(['plot', 'hapaxes', '_cumulative_frequencies', 'r_Nr', 'pprint', 'N', 'unicode_repr', 'B', 'tabulate', 'pformat', 'max', 'Nr', '频率', '__unicode__'])

说到使用FreqDist.most_common(),实际上是使用了Counter的父函数,所以检索排序后的most_common的速度> 两种类型的列表相同.

就个人而言,当我只想检索计数时,我使用 collections.Counter.但是当我需要进行一些统计操作时,我要么使用 nltk.FreqDist,要么将 Counter 转储到 pandas.DataFrame(请参阅将 Counter 对象转换为 Pandas DataFrame).

I want to calculate the term-frequencies of words in a text corpus. I've been using NLTK's word_tokenize followed by probability.FreqDist for some time to get this done. The word_tokenize returns a list, which is converted to a frequency distribution by FreqDist. However, I recently came across the Counter function in collections (collections.Counter), which seems to be doing the exact same thing. Both FreqDist and Counter have a most_common(n) function which return the n most common words. Does anyone know if there's a difference between these two? Is one faster than the other? Are there cases where one would work and the other wouldn't?

解决方案

nltk.probability.FreqDist is a subclass of collections.Counter.

From the docs:

A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.

The inheritance is explicitly shown from the code and essentially, there's no difference in terms of how a Counter and FreqDist is initialized, see https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L106

So speed-wise, creating a Counter and FreqDist should be the same. The difference in speed should be insignificant but it's good to note that the overheads could be:

  • the compilation of the class in when defining it in an interpreter
  • the cost of duck-typing .__init__()

The major difference is the various functions that FreqDist provides for statistical / probabilistic Natural Language Processing (NLP), e.g. finding hapaxes. The full list of functions that FreqDist extends Counter are as followed:

>>> from collections import Counter
>>> from nltk import FreqDist
>>> x = FreqDist()
>>> y = Counter()
>>> set(dir(x)).difference(set(dir(y)))
set(['plot', 'hapaxes', '_cumulative_frequencies', 'r_Nr', 'pprint', 'N', 'unicode_repr', 'B', 'tabulate', 'pformat', 'max', 'Nr', 'freq', '__unicode__'])

When it comes to using FreqDist.most_common(), it's actually using the parent function from Counter so the speed of retrieving the sorted most_common list is the same for both types.

Personally, when I just want to retrieve counts, I use collections.Counter. But when I need to do some statistical manipulation, I either use nltk.FreqDist or I would dump the Counter into a pandas.DataFrame (see Transform a Counter object into a Pandas DataFrame).

这篇关于Python 的 collections.Counter 和 nltk.probability.FreqDist 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆