如何计算语料库文档中的单词 [英] How to count words in a corpus document

查看：102 发布时间：2020/5/18 1:20:24 python nltk

本文介绍了如何计算语料库文档中的单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道计算文档中单词的最佳方法.如果我有自己的"corp.txt"语料库设置，并且我想知道"corp.txt"文件中学生，信任，ayre"的发生频率.我该怎么用?

I want to know the best way to count words in a document. If I have my own "corp.txt" corpus setup and I want to know how frequently "students, trust, ayre" occur in the file "corp.txt". What could I use?

会是以下之一:

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

谢谢，雷

推荐答案

我建议您查看collections.Counter.尤其是对于大量文本，这是可行的，并且仅受可用内存的限制.在配备12Gb内存的计算机上，它一天半就可以计算出300亿个令牌.伪代码(可变字实际上是对文件或类似文件的某种引用):

I would suggest looking into collections.Counter. Especially for large amounts of text, this does the trick and is only limited by the available memory. It counted 30 billions tokens in a day and a half on a computer with 12Gb of ram. Pseudocode (variable Words will in practice be some reference to a file or similar):

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

完成后，这些单词将在字典my_counter中，然后可以将其写入磁盘或存储在其他位置(例如sqlite).

When finished the words are in a dictionary my_counter which then can be written to disk or stored elsewhere (sqlite for example).

这篇关于如何计算语料库文档中的单词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何计算语料库文档中的单词 [英] How to count words in a corpus document

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何计算语料库文档中的单词 [英] How to count words in a corpus document

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭