为什么pos_tag()如此缓慢,却可以避免呢? [英] Why is pos_tag() so painfully slow and can this be avoided?

查看:128
本文介绍了为什么pos_tag()如此缓慢,却可以避免呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够以这种方式一个接一个地获得句子的POS标签:

I want to be able to get POS-Tags of sentences one by one like in this manner:

def __remove_stop_words(self, tokenized_text, stop_words):

    sentences_pos = nltk.pos_tag(tokenized_text)  
    filtered_words = [word for (word, pos) in sentences_pos 
                      if pos not in stop_words and word not in stop_words]

    return filtered_words

但是问题是每个句子的pos_tag()大约需要一秒钟.还有一个选择使用pos_tag_sents()批量执行此操作并加快处理速度.但是,如果我能逐句地做这件事,我的生活会更轻松.

But the problem is that pos_tag() takes about a second for each sentence. There is another option to use pos_tag_sents() to do this batch-wise and speed things up. But my life would be easier if I could do this sentence by sentence.

有没有办法更快地做到这一点?

Is there a way to do this faster?

推荐答案

对于nltk版本3.1,位于

For nltk version 3.1, inside nltk/tag/__init__.py, pos_tag is defined like this:

from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger)    

因此,每次对pos_tag的调用都会首先实例化PerceptronTagger,这会花费一些时间,因为它涉及到tagsetNone. 因此,您可以通过一次加载文件 并自己调用tagger.tag而不是调用pos_tag:

So each call to pos_tag first instantiates PerceptronTagger which takes some time because it involves loading a pickle file. _pos_tag simply calls tagger.tag when tagset is None. So you can save some time by loading the file once, and calling tagger.tag yourself instead of calling pos_tag:

from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger() 
def __remove_stop_words(self, tokenized_text, stop_words, tagger=tagger):
    sentences_pos = tagger.tag(tokenized_text)  
    filtered_words = [word for (word, pos) in sentences_pos 
                      if pos not in stop_words and word not in stop_words]

    return filtered_words


pos_tag_sents使用与上述相同的技巧-实例化一次,然后多次调用_pos_tag.因此,使用上述代码,您将获得与重构和调用pos_tag_sents相同的性能提升.


pos_tag_sents uses the same trick as above -- it instantiates PerceptronTagger once before calling _pos_tag many times. So you'll get a comparable gain in performance using the above code as you would by refactoring and calling pos_tag_sents.

此外,如果stop_words是长列表,则可以通过将stop_words设置为一组来节省时间:

Also, if stop_words is a long list, you may save a bit of time by making stop_words a set:

stop_words = set(stop_words)

因为检查集合中的成员资格(例如pos not in stop_words)是O(1)(恒定时间)操作,而检查列表中的成员资格是O(n)操作(即,它需要的时间与时间长度成正比增长)列表.)

since checking membership in a set (e.g. pos not in stop_words) is a O(1) (constant time) operation while checking membership in a list is a O(n) operation (i.e. it requires time which grows proportionally to the length of the list.)

这篇关于为什么pos_tag()如此缓慢,却可以避免呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆