在 Python 中删除停用词的更快方法 [英] Faster way to remove stop words in Python

查看:80
本文介绍了在 Python 中删除停用词的更快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从文本字符串中删除停用词:

I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

我正在处理 600 万个此类字符串,因此速度很重要.分析我的代码,最慢的部分是上面的几行,有没有更好的方法来做到这一点?我正在考虑使用诸如正则表达式的 re.sub 之类的东西,但我不知道如何为一组单词编写模式.有人可以帮我一下吗,我也很高兴听到其他可能更快的方法.

I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.sub but I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.

注意:我试过有人建议用 set() 包裹 stopwords.words('english') ,但没有任何区别.

Note: I tried someone's suggest of wrapping stopwords.words('english') with set() but that made no difference.

谢谢.

推荐答案

尝试缓存停用词对象,如下所示.每次调用函数时都构造这个似乎是瓶颈.

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

我通过分析器运行了这个:python -m cProfile -scumulative test.py.相关线路贴在下面.

I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

nCalls 累计时间

nCalls Cumulative Time

10000 7.723 words.py:7(testFuncOld)

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

10000 0.140 words.py:11(testFuncNew)

因此,缓存停用词实例可提高约 70 倍的速度.

So, caching the stopwords instance gives a ~70x speedup.

这篇关于在 Python 中删除停用词的更快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆