Gensim Word2Vec使用过多的内存 [英] Gensim Word2Vec uses too much memory

查看:202
本文介绍了Gensim Word2Vec使用过多的内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在大小为400MB的标记化文件上训练word2vec模型.我一直在尝试运行以下python代码:

I want to train a word2vec model on a tokenized file of size 400MB. I have been trying to run this python code :

import operator
import gensim, logging, os
from gensim.models import Word2Vec
from gensim.models import *

class Sentences(object):
    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        for line in open(self.filename):
            yield line.split()

def runTraining(input_file,output_file):
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    sentences = Sentences(input_file)
    model = gensim.models.Word2Vec(sentences, size=200)
    model.save(output_file)

当我在文件上调用此函数时,得到此信息:

When I call this function on my file, I get this :

2017-10-23 17:57:00,211 : INFO : collecting all words and their counts
2017-10-23 17:57:04,071 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-10-23 17:57:16,116 : INFO : collected 4735816 word types from a corpus of 47054017 raw words and 1 sentences
2017-10-23 17:57:16,781 : INFO : Loading a fresh vocabulary
2017-10-23 17:57:18,873 : INFO : min_count=5 retains 290537 unique words (6% of original 4735816, drops 4445279)
2017-10-23 17:57:18,873 : INFO : min_count=5 leaves 42158450 word corpus (89% of original 47054017, drops 4895567)
2017-10-23 17:57:19,563 : INFO : deleting the raw counts dictionary of 4735816 items
2017-10-23 17:57:20,217 : INFO : sample=0.001 downsamples 34 most-common words
2017-10-23 17:57:20,217 : INFO : downsampling leaves estimated 35587188 word corpus (84.4% of prior 42158450)
2017-10-23 17:57:20,218 : INFO : estimated required memory for 290537 words and 200 dimensions: 610127700 bytes
2017-10-23 17:57:21,182 : INFO : resetting layer weights
2017-10-23 17:57:24,493 : INFO : training model with 3 workers on 290537 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-10-23 17:57:28,216 : INFO : PROGRESS: at 0.00% examples, 0 words/s, in_qsize 0, out_qsize 0
2017-10-23 17:57:32,107 : INFO : PROGRESS: at 20.00% examples, 1314 words/s, in_qsize 0, out_qsize 0
2017-10-23 17:57:36,071 : INFO : PROGRESS: at 40.00% examples, 1728 words/s, in_qsize 0, out_qsize 0
2017-10-23 17:57:41,059 : INFO : PROGRESS: at 60.00% examples, 1811 words/s, in_qsize 0, out_qsize 0
Killed

我知道word2vec需要很多空间,但是我仍然认为这里存在问题.如您所见,此型号的估计内存为600MB,而我的计算机具有16GB的RAM.然而,在代码运行时监视该过程会发现它占用了我所有的内存,然后被杀死.

I know that word2vec needs a lot of space, but I still think there is a problem here. As you see the estimated memory for this model is of 600MB, while my computer has 16GB of RAM. Yet monitoring the process while the code runs shows that it occupies all of my memory and then gets killed.

正如其他帖子所建议的那样,我尝试增加min_count并减小大小.但是,即使值太可笑(min_count = 50,size = 10),该过程也会在60%处停止.

As other posts advise I have tried to increase min_count and decrease size. But even with ridiculous values (min_count=50, size=10) the process stops at 60%.

我还试图使python成为OOM的例外,以免进程被杀死.当我这样做时,我遇到了MemoryError而不是被杀死.

I also tried to make python an exception to OOM so that the process doesn't get killed. When I do that, I have a MemoryError instead of the killing.

发生了什么事?

(我使用的是最近的笔记本电脑,它具有Ubuntu 17.04、16GB RAM和Nvidia GTX 960M.我从Anaconda和gensim 3.0运行python 3.6,但在gensim 2.3方面做得更好)

(I use a recent laptop with Ubuntu 17.04, 16GB RAM and a Nvidia GTX 960M. I run python 3.6 from Anaconda and gensim 3.0, but it does'nt do better with gensim 2.3)

推荐答案

您的文件是一行,如日志输出所示:

Your file is a single line, as indicated by the log output:

2017-10-23 17:57:16,116 : INFO : collected 4735816 word types from a corpus of 47054017 raw words and 1 sentences

这是您想要的,这令人怀疑;特别是gensim的Word2Vec中经过优化的cython代码在截断之前只能处理10,000个单词的句子(并丢弃其余单词).因此,在培训过程中不会考虑您的大多数数据(即使已完成).

It is doubtful that this is what you want; in particular the optimized cython code in gensim's Word2Vec can only handle sentences of 10,000 words before truncating them (and discarding the rest). So most of your data isn't being considered during training (even if it were to finish).

但是更大的问题是,单个4700万字的行将作为一个巨大的字符串进入内存,然后split()成为4700万个条目的字符串列表.因此,您尝试使用内存效率高的迭代器无济于事–整个文件都将被存储到内存中,一次迭代"两次.

But the bigger problem is that single 47-million-word line will come into memory as one gigantic string, then be split() into a 47-million-entry list-of-strings. So your attempt to use a memory-efficient iterator isn't helping any – the full file is being brought into memory, twice over, for a single 'iteration'.

我仍然没有看到使用完整的16GB RAM,但是也许可以通过纠正该问题来解决该问题,或者使所有其他问题更加明显.

I still don't see that using a full 16GB RAM, but perhaps correcting that will resolve the issue, or make whatever remaining issues more evident.

如果标记化的数据在10,000个标记长度左右的句子长度以下或以下没有自然的换行符,您可以查看gensim中包含的示例语料库类LineSentence如何能够在(也缺少行)上工作中断)text8text9语料库,将每个产生的句子限制为10,000个标记:

If your tokenized data doesn't have natural line breaks around or below the 10,000-token sentence length, you can look how the example corpus class LineSentence, included in gensim to be able to work on the (also missing line breaks) text8 or text9 corpuses, limits each yielded sentence to 10,000 tokens:

https://github.com /RaRe-Technologies/gensim/blob/58b30d71358964f1fc887477c5dc1881b634094a/gensim/models/word2vec.py#L1620

(这可能不是一个重要因素,但您可能还想使用with上下文管理器,以确保迭代器用尽后立即关闭您的open()文件.)

(It may not be a contributing factor but you may also want to use the with context-manager to ensure your open()ed file is promptly closed after the iterator is exhausted.)

这篇关于Gensim Word2Vec使用过多的内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆