如何快速获取语料库中的单词集合(使用nltk)? [英] How to quickly get the collection of words in a corpus (with nltk)?
本文介绍了如何快速获取语料库中的单词集合(使用nltk)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想用nltk快速为一个语料库建立一个单词查询表.下面是我在做什么:
I would like to quickly build a word look-up table for a corpus with nltk. Below is what I am doing:
- 读取原始文本:file = open("corpus","r").read().decode('utf-8')
- 使用a = nltk.word_tokenize(file)获取所有令牌;
- 使用set(a)获取唯一令牌,并将其隐式返回列表.
这是执行此任务的正确方法吗?
Is this the right way of doing this task?
推荐答案
尝试:
import time
from collections import Counter
from nltk import FreqDist
from nltk.corpus import brown
from nltk import word_tokenize
def time_uniq(maxchar):
# Let's just take the first 10000 characters.
words = brown.raw()[:maxchar]
# Time to tokenize
start = time.time()
words = word_tokenize(words)
print time.time() - start
# Using collections.Counter
start = time.time()
x = Counter(words)
uniq_words = x.keys()
print time.time() - start
# Using nltk.FreqDist
start = time.time()
FreqDist(words)
uniq_words = x.keys()
print time.time() - start
# If you don't need frequency info, use set()
start = time.time()
uniq_words = set(words)
print time.time() - start
[输出]:
~$ python test.py
0.0413908958435
0.000495910644531
0.000432968139648
9.3936920166e-05
0.10734796524
0.00458407402039
0.00439405441284
0.00084400177002
1.12890005112
0.0492491722107
0.0490930080414
0.0100378990173
要加载您自己的语料库文件(假设您的文件足够小以适合RAM):
To load your own corpus file (assuming that your file is small enough to fit into the RAM):
from collections import Counter
from nltk import FreqDist, word_tokenize
with open('myfile.txt', 'r') as fin:
# Using Counter.
x = Counter(word_tokenize(fin.read()))
uniq = x.keys()
# Using FreqDist
x = Counter(word_tokenize(fin.read()))
uniq = x.keys()
# Using Set
uniq = set(word_tokenize(fin.read()))
如果文件太大,则可能一次要处理一行文件:
If file is too big, possibly you want to process the file one line at a time:
from collections import Counter
from nltk import FreqDist, word_tokenize
from nltk.corpus import brown
# Using Counter.
x = Counter()
with open('myfile.txt', 'r') as fin:
for line in fin.split('\n'):
x.update(word_tokenize(line))
uniq = x.keys()
# Using Set.
x = set()
with open('myfile.txt', 'r') as fin:
for line in fin.split('\n'):
x.update(word_tokenize(line))
uniq = x.keys()
这篇关于如何快速获取语料库中的单词集合(使用nltk)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文