POS-Tagger 非常慢 [英] POS-Tagger is incredibly slow
本文介绍了POS-Tagger 非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在使用 nltk
通过首先删除给定的停用词从句子生成 n-gram.但是,nltk.pos_tag()
在我的 CPU (Intel i7) 上运行速度非常慢,最多需要 0.6 秒.
输出:
['第一次去,完全被现场爵士乐队和气氛所吸引,我点了龙虾科布沙拉.']0.620481014252[这简直是纽约最好的一餐."]0.640982151031['在红眼烧烤店你不会出错.']0.644664049149
代码:
用于源代码中的句子:nltk_ngrams = 无如果 stop_words 不是 None:开始 = time.time()sentence_pos = nltk.pos_tag(word_tokenize(sentence))打印 time.time() - 开始filtered_words = [word for (word, pos) in sentence_pos 如果 pos 不在 stop_words 中]别的:Filtered_words = ngrams(sentence.split(), n)
这真的很慢还是我在这里做错了什么?
解决方案
使用 pos_tag_sents
标记多个句子:
I am using nltk
to generate n-grams from sentences by first removing given stop words. However, nltk.pos_tag()
is extremely slow taking up to 0.6 sec on my CPU (Intel i7).
The output:
['The first time I went, and was completely taken by the live jazz band and atmosphere, I ordered the Lobster Cobb Salad.']
0.620481014252
["It's simply the best meal in NYC."]
0.640982151031
['You cannot go wrong at the Red Eye Grill.']
0.644664049149
The code:
for sentence in source:
nltk_ngrams = None
if stop_words is not None:
start = time.time()
sentence_pos = nltk.pos_tag(word_tokenize(sentence))
print time.time() - start
filtered_words = [word for (word, pos) in sentence_pos if pos not in stop_words]
else:
filtered_words = ngrams(sentence.split(), n)
Is this really that slow or am I doing something wrong here?
解决方案
Use pos_tag_sents
for tagging multiple sentences:
>>> import time
>>> from nltk.corpus import brown
>>> from nltk import pos_tag
>>> from nltk import pos_tag_sents
>>> sents = brown.sents()[:10]
>>> start = time.time(); pos_tag(sents[0]); print time.time() - start
0.934092998505
>>> start = time.time(); [pos_tag(s) for s in sents]; print time.time() - start
9.5061340332
>>> start = time.time(); pos_tag_sents(sents); print time.time() - start
0.939551115036
这篇关于POS-Tagger 非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文