处理文本文档时,Python代码不间断 [英] Python code non-stop when processing text documents

查看:102
本文介绍了处理文本文档时,Python代码不间断的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行以下代码来处理文档列表,基本上,这只是两个for循环.

from nltk.tokenize import TreebankWordTokenizer
from gensim.models import KeyedVectors
from nlpia.loaders import get_data
word_vectors = get_data('w2v', limit=200000)

def tokenize_and_vectorize(dataset):
    tokenizer = TreebankWordTokenizer()
    vectorized_data = []
    expected = []
    for sample in dataset:
        tokens = tokenizer.tokenize(sample[1])
        sample_vecs = []
        for token in tokens:
            try:
               sample_vecs.append(word_vectors[token])

            except KeyError:
               pass  

        vectorized_data.append(sample_vecs)
        #print(1)
    return vectorized_data

然后我调用该函数来处理前25k个元素

vectorized_data=tokenize_and_vectorize(dataset[0:25000])

但是,此代码似乎永远运行,因为*符号永远不会消失. (注意:我确实尝试只运行50个样本,结果很快返回)

为了查看它被卡住的位置,我在return vectorized_data之前天真地添加了print(1),因此对于每个循环,它都返回1.1min36sec之后,我得到了所有结果.

计算机内存使用情况的侧面观察.在没有添加print(1)的情况下,我确实观察到开始时内存使用率很高,并且在几分钟后又回落到正常水平,虽然*仍在显示*符号,但不确定这是否表明该过程已完成.

是什么原因导致此问题,该如何解决?

解决方案

我假设您的数据集包含字符串,即文本行,书本等.因此,您的每一行都被分解为单词,然后被转化为单词单词向量.

如果您的行很长,或者您试图一次处理很多行,则数据可能会花费很长时间.

关于您的问题,"*"是什么意思(来源: 解决方案

I assume your dataset contains strings i.e. lines of text, a book, etc. Hence each of your lines is then broken up into words, which then are turned into word vectors.

It could be that your data takes a long time if your lines are very long or if you are trying to process a lot of lines at once.

Regarding your question what the '*' means (Source: answer by Gopi Kumar)

An asterisk on Jupyter cell means that cell is still waiting to run. Please check the preceding cells to see the one that is currently running. It is possible you may have an error on one of the previous cell. Also if you see a dark circle on the top right of the browser it means a cell is still executing. A clear circle means it is idle.

这篇关于处理文本文档时,Python代码不间断的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆