Python的collections.Counter和nltk.probability.FreqDist之间的区别 [英] Difference between Python's collections.Counter and nltk.probability.FreqDist

查看:115
本文介绍了Python的collections.Counter和nltk.probability.FreqDist之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算文本语料库中单词的词频.我一直在使用NLTK的word_tokenize,然后再使用几率.FreqDist完成此操作. word_tokenize返回一个列表,该列表由FreqDist转换为频率分布.但是,我最近在集合中遇到了Counter函数(collections.Counter),这似乎在做同样的事情. FreqDist和Counter都具有most_common(n)函数,该函数返回n个最常见的单词.有谁知道这两者之间是否有区别?一个比另一个快吗?在某些情况下,一种方法可以工作而另一种方法不能工作吗?

I want to calculate the term-frequencies of words in a text corpus. I've been using NLTK's word_tokenize followed by probability.FreqDist for some time to get this done. The word_tokenize returns a list, which is converted to a frequency distribution by FreqDist. However, I recently came across the Counter function in collections (collections.Counter), which seems to be doing the exact same thing. Both FreqDist and Counter have a most_common(n) function which return the n most common words. Does anyone know if there's a difference between these two? Is one faster than the other? Are there cases where one would work and the other wouldn't?

推荐答案

nltk.probability.FreqDistcollections.Counter的子类.

文档:

实验结果的频率分布.一种 频率分布记录了一个结果的每个结果的次数 实验已经发生.例如,频率分布可以 用于记录文档中每个单词类型的出现频率. 形式上,频率分布可以定义为一个函数 从每个样本到样本发生次数的映射 作为结果.

A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.

从代码中明确显示了继承基本上,在CounterFreqDist的初始化方式方面没有区别,请参见

The inheritance is explicitly shown from the code and essentially, there's no difference in terms of how a Counter and FreqDist is initialized, see https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L106

所以从速度角度来看,创建CounterFreqDist应该相同.速度的差异应该微不足道,但要注意的是开销可能是:

So speed-wise, creating a Counter and FreqDist should be the same. The difference in speed should be insignificant but it's good to note that the overheads could be:

  • 在解释器中定义类时的编译
  • 鸭嘴式.__init__()的成本
  • the compilation of the class in when defining it in an interpreter
  • the cost of duck-typing .__init__()

主要区别在于FreqDist为统计/概率自然语言处理(NLP)提供的各种功能,例如找到hapaxes . FreqDist扩展Counter的功能的完整列表如下:

The major difference is the various functions that FreqDist provides for statistical / probabilistic Natural Language Processing (NLP), e.g. finding hapaxes. The full list of functions that FreqDist extends Counter are as followed:

>>> from collections import Counter
>>> from nltk import FreqDist
>>> x = FreqDist()
>>> y = Counter()
>>> set(dir(x)).difference(set(dir(y)))
set(['plot', 'hapaxes', '_cumulative_frequencies', 'r_Nr', 'pprint', 'N', 'unicode_repr', 'B', 'tabulate', 'pformat', 'max', 'Nr', 'freq', '__unicode__'])

在使用FreqDist.most_common()时,实际上使用的是Counter中的父函数,因此两种类型的检索排序后的most_common列表的速度都相同.

When it comes to using FreqDist.most_common(), it's actually using the parent function from Counter so the speed of retrieving the sorted most_common list is the same for both types.

就个人而言,当我只想检索计数时,我使用collections.Counter.但是,当我需要进行一些统计操作时,可以使用nltk.FreqDist,也可以将Counter转储到pandas.DataFrame中(请参阅

Personally, when I just want to retrieve counts, I use collections.Counter. But when I need to do some statistical manipulation, I either use nltk.FreqDist or I would dump the Counter into a pandas.DataFrame (see Transform a Counter object into a Pandas DataFrame).

这篇关于Python的collections.Counter和nltk.probability.FreqDist之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆