gensim-Doc2Vec:在英语维基百科上进行培训时出现MemoryError [英] gensim - Doc2Vec: MemoryError when training on english Wikipedia

查看:247
本文介绍了gensim-Doc2Vec:在英语维基百科上进行培训时出现MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从英语维基百科转储中提取了145,185,965个句子(14 GB),并且我想根据这些句子训练Doc2Vec模型.不幸的是,当我尝试训练时,我只有32GB的RAM并收到 MemoryError .即使我将min_count设置为50,gensim也会告诉我,它将需要超过150GB的RAM.我认为进一步增加min_count并不是一个好主意,因为生成的模型不是很好(只是一个猜测).但是无论如何,我将用500尝试一下,看内存是否足够.

I extracted 145,185,965 sentences (14GB) out of the english wikipedia dump and I want to train a Doc2Vec model based on these sentences. Unfortunately I have 'only' 32GB of RAM and get a MemoryError when trying to train. Even if I set the min_count to 50, gensim tells me that it would need over 150GB of RAM. I don't think that further increasing the min_count would be a good idea, because the resulting model would be not very good (just a guess). But anyways, I will try it with 500 to see if memory is sufficient then.

有可能用有限的RAM训练这么大的模型吗?

Are there any possibilities to train such a large model with limited RAM?

这是我当前的代码:

corpus = TaggedLineDocument(preprocessed_text_file)
model = Doc2Vec(vector_size=300, 
                window=15, 
                min_count=50,  #1
                workers=16, 
                dm=0, 
                alpha=0.75, 
                min_alpha=0.001, 
                sample=0.00001,
                negative=5)
model.build_vocab(corpus)
model.train(corpus, 
            epochs=400, 
            total_examples=model.corpus_count, 
            start_alpha=0.025, 
            end_alpha=0.0001)

我正在做一些明显的错误吗?使用它完全错误吗?

Are there maybe some obvious mistakes I am doing? Using it completely wrong?

我也可以尝试减小向量的大小,但是我认为这会导致更糟糕的结果,因为大多数论文都使用300D向量.

I could also try reducing the vector size, but I think this will result in much worse results as most papers use 300D vectors.

推荐答案

可寻址内存中所需的模型大小很大程度上取决于所需权重的数量,取决于唯一单词和唯一doc标签的数量.

The required model size in addressable memory is largely a function of the number of weights required, by the number of unique words and unique doc-tags.

凭借145,000,000个独特的文档标签,无论您限制自己使用多少字,仅培训中的原始文档矢量就将需要:

With 145,000,000 unique doc-tags, no matter how many words you limit yourself to, just the raw doc-vectors in-training alone will require:

145,000,000 * 300 dimensions * 4 bytes/dimension = 174GB

您可以尝试使用较小的数据集.您可以减小向量的大小.您可以获得更多的内存.

You could try a smaller data set. You could reduce the vector size. You could get more memory.

我会首先尝试其中的一个或多个,只是为了验证您能够使事情正常进行并获得一些初步结果.

I would try one or more of those first, just to verify you're able to get things working and some initial results.

有一个技巧,最好被认为是实验性的,它可以用来训练更大的文档向量集,但代价是额外的复杂性和较低的性能:Doc2Vecdocvecs_mapfile参数.

There is one trick, best considered experimental, that may work to allow training larger sets of doc-vectors, at some cost of extra complexity and lower performance: the docvecs_mapfile parameter of Doc2Vec.

通常,您不希望Word2Vec/Doc2Vec样式的培训课程使用任何虚拟内存,因为对较慢的磁盘IO的任何求助都会使培训极其缓慢.但是,对于仅按一个顺序进行迭代的大型文档集,在使doc-vector数组由内存映射文件支持后,性能下降可能仍然可以幸免.从本质上讲,每个训练遍历都从字体到字体遍历文件,一次读入每个部分,然后将其分页一次.

Normally, you don't want a Word2Vec/Doc2Vec-style training session to use any virtual memory, because any recourse to slower disk IO makes training extremely slow. However, for a large doc-set, which is only ever iterated over in one order, the performance hit may be survivable after making the doc-vectors array to be backed by a memory-mapped file. Essentially, each training pass sweeps through the file from font-to-back, reading each section in once and paging it out once.

如果提供docvecs_mapfile参数,则Doc2Vec将分配doc-vectors数组以由该磁盘文件支持.因此,您将在磁盘上拥有一个数百GB的文件(最好是SSD),该文件的范围将根据需要在RAM中进行页面调入/调出.

If you supply a docvecs_mapfile argument, Doc2Vec will allocate the doc-vectors array to be backed by that on-disk file. So you'll have a hundreds-of-GB file on disk (ideally SSD) whose ranges are paged in/out of RAM as necessary.

如果尝试这样做,请务必先在小批量运行中尝试使用此选项,以熟悉其操作,尤其是在保存/加载模型周围.

If you try this, be sure to experiment with this option on small runs first, to familiarize yourself with its operation, especially around saving/loading models.

还请注意,如果您随后对doc-vector执行默认的most_similar(),则必须从原始数组创建另一个174GB的单位归一化向量数组. (您可以通过在调用任何其他需要单位归一化向量的方法之前显式调用init_sims(replace=True)调用来强制执行此操作,从而破坏现有的原始值.)

Note also that if you then ever do a default most_similar() on doc-vectors, another 174GB array of unit-normalized vectors must be created from the raw array. (You can force that to be done in-place, clobbering the existing raw values, by explicitly calling the init_sims(replace=True) call before any other method requiring the unit-normed vectors is called.)

这篇关于gensim-Doc2Vec:在英语维基百科上进行培训时出现MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆