如何计算语料库文档中的单词 [英] How to count words in a corpus document

查看:102
本文介绍了如何计算语料库文档中的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道计算文档中单词的最佳方法.如果我有自己的"corp.txt"语料库设置,并且我想知道"corp.txt"文件中学生,信任,ayre"的发生频率.我该怎么用?

I want to know the best way to count words in a document. If I have my own "corp.txt" corpus setup and I want to know how frequently "students, trust, ayre" occur in the file "corp.txt". What could I use?

会是以下之一:

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

谢谢, 雷

推荐答案

我建议您查看collections.Counter.尤其是对于大量文本,这是可行的,并且仅受可用内存的限制.在配备12Gb内存的计算机上,它一天半就可以计算出300亿个令牌.伪代码(可变字实际上是对文件或类似文件的某种引用):

I would suggest looking into collections.Counter. Especially for large amounts of text, this does the trick and is only limited by the available memory. It counted 30 billions tokens in a day and a half on a computer with 12Gb of ram. Pseudocode (variable Words will in practice be some reference to a file or similar):

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

完成后,这些单词将在字典my_counter中,然后可以将其写入磁盘或存储在其他位置(例如sqlite).

When finished the words are in a dictionary my_counter which then can be written to disk or stored elsewhere (sqlite for example).

这篇关于如何计算语料库文档中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆