如何在Python中优化字数统计? [英] How optimize word counting in Python?

查看：105 发布时间：2020/5/18 1:00:00 python-2.7 optimization nlp nltk

本文介绍了如何在Python中优化字数统计?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在迈出第一步，编写代码对文本进行语言分析.我使用Python和NLTK库.问题在于，实际的单词计数占用了我CPU的近100％(iCore5、8GB RAM，macbook air 2014)并运行了14个小时，然后我关闭了进程.如何加快循环和计数的速度?

I'm taking my first steps writing code to do linguistic analysis of texts. I use Python and the NLTK library. The problem is that the actual counting of words takes up close to 100 % of my CPU (iCore5, 8GB RAM, macbook air 2014) and ran for 14 hours before I shut the process down. How can I speed the looping and counting up?

我已经用三个瑞典UTF-8格式的，制表符分隔的文件Swe_Newspapers.txt，Swe_Blogs.txt，Swe_Twitter.txt在NLTK中创建了一个语料库.效果很好:

I have created a corpus in NLTK out of three Swedish UTF-8 formatted, tab-separated files Swe_Newspapers.txt, Swe_Blogs.txt, Swe_Twitter.txt. It works fine:

import nltk
my_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(".", r"Swe_.*", cat_pattern=r"Swe_(\w+)\.txt")

然后，我将每行一个单词的文本文件加载到NLTK中.那也很好.

Then I've loaded a text-file with one word per line into NLTK. That also works fine.

my_wordlist = nltk.corpus.WordListCorpusReader("/Users/mos/Documents/", "wordlist.txt")

我要分析的文本文件(Swe_Blogs.txt)具有此结构，并且可以很好地解析:

The text-file I want to analyse (Swe_Blogs.txt) has this structure, and works fine to parse:

Wordpress.com   2010/12/08  3   1,4,11  osv osv osv …
bloggagratis.se 2010/02/02  3   0   Jag är utled på plogade vägar, matte är lika utled hon.
wordpress.com   2010/03/10  3   0   1 kruka Sallad, riven

编辑:产生如下计数器的建议不起作用，但可以解决:

The suggestion to produce the counter as below, does not work, but can be fixed:

counter = collections.Counter(word for word in my_corpus.words(categories=["Blogs"]) if word in my_wordlist)

这会产生错误:

IOError                                   Traceback (most recent call last)
<ipython-input-41-1868952ba9b1> in <module>()
----> 1 counter = collections.Counter(word for word in my_corpus.words("Blogs") if word    in my_wordlist)
       /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, categories)
182     def words(self, fileids=None, categories=None):
183         return PlaintextCorpusReader.words(
--> 184             self, self._resolve(fileids, categories))
185     def sents(self, fileids=None, categories=None):
186         return PlaintextCorpusReader.sents(

                /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, sourced)
 89                                            encoding=enc)
 90                            for (path, enc, fileid)
 ---> 91                            in self.abspaths(fileids, True, True)])
 92 
 93 
 /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/api.pyc in abspaths(self, fileids, include_encoding, include_fileid)
165             fileids = [fileids]
166 
--> 167         paths = [self._root.join(f) for f in fileids]
168 
169         if include_encoding and include_fileid:  

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/      lib/python2.7/site-packages/nltk/data.pyc in join(self, fileid)
174     def join(self, fileid):
175         path = os.path.join(self._path, *fileid.split('/'))
--> 176         return FileSystemPathPointer(path)
177 
178     def __repr__(self):

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/  lib/python2.7/site-packages/nltk/data.pyc in __init__(self, path)
152         path = os.path.abspath(path)
153         if not os.path.exists(path):
--> 154             raise IOError('No such file or directory: %r' % path)
155         self._path = path

IOError: No such file or directory: '/Users/mos/Documents/Blogs'

一种解决方案是将my_corpus(categories = ["Blogs"]]分配给变量:

A fix is to assign my_corpus(categories=["Blogs"] to a variable:

blogs_text = my_corpus.words(categories=["Blogs"])

当我尝试计算语料库博客(115,7 MB)中的单词表中每个单词的所有出现次数(约2万个单词)时，我的计算机有点累.如何加快以下代码的速度?它似乎可以正常工作，没有错误消息，但是执行需要14小时以上的时间.

It's when I try to count all occurrences of each word (about 20K words) in the wordlist within the blogs in the corpus (115,7 MB) that my computer get's a little tired. How can I speed up the following code? It seems to work, no error messages, but it takes >14h to execute.

import collections
counter = collections.Counter()

for word in my_corpus.words(categories="Blogs"):
    for token in my_wordlist.words():
        if token == word:
            counter[token]+=1
        else:
            continue

感谢您对提高我的编码技能的任何帮助！

Any help to improve my coding skills is much appreciated!

如何在Python中优化字数统计? [英] How optimize word counting in Python?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Python中优化字数统计? [英] How optimize word counting in Python?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭