Gensim Word2Vec模型受过训练,但未保存 [英] Gensim Word2Vec Model trained but not saved

查看:380
本文介绍了Gensim Word2Vec模型受过训练,但未保存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用gensim并执行以下代码(简化):

I am using gensim and executed the following code (simplified):

model = gensim.models.Word2Vec(...)
mode.build_vocab(sentences)
model.train(...)
model.save('file_name')

几天后,我的代码完成了model.train(...).但是,在保存过程中,我遇到了:

After days my code finished model.train(...). However, during saving, I experienced:

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

我注意到生成了一些npy文件:

I noticed that there were some npy files generated:

<...>.trainables.syn1neg.npy
<...>.trainables.vectors_lockf.npy
<...>.wv.vectors.npy

这些中间结果是我可以重用还是必须重新运行整个过程?

Are those intermediate results I can re-use or do I have to rerun the entire process?

推荐答案

这些是已保存模型的一部分,但是除非主file_name文件(由Python提取的对象)存在且完整,否则可能很难做到再利用.

Those are parts of the saved model, but unless the master file_name file (a Python-pickled object) exists and is complete, they may be hard to re-use.

但是,如果您的主要兴趣是最终的词向量,则这些词向量位于.wv.vectors.npy文件中.如果它似乎是全长文件(syn1neg文件中的文件大小相同),则它可能是完整的.您所缺少的是告诉您哪个单词在哪个索引中的字典.

However, if your primary interest is the final word-vectors, those are in the .wv.vectors.npy file. If it appears to be full-length (same size at the syn1neg file), it may be complete. What you're missing is the dict that tells you which word is in which index.

因此,以下可能起作用:

  1. 使用相同的语料库& ;;重复相同的语料,重复原始过程.模型参数,但只能通过build_vocab()步骤进行.那时,新的model.wv.vocab字典应该与失败保存运行中的字典相同.

  1. Repeat the original process, with the exact same corpus & model parameters, but only through the build_vocab() step. At that point, the new model.wv.vocab dict should be identical as the one from the failed-save-run.

将该模型保存到新文件名,而无需train()对其进行任何保存.

Save that model, without ever train()ing it, to a new filename.

确认newmodel.wv.vectors.npy(具有随机初始化的未经训练的向量)的大小与oldmodel.wv.vectors.npy相同,然后将oldmodel文件复制到新模型的名称.

Confirming that newmodel.wv.vectors.npy (with randomly-initialized untrained vectors) is the same size as oldmodel.wv.vectors.npy, copy the oldmodel file to the newmodel's name.

重新加载新模型,并运行一些合理性检查,以确保这些词有意义.

Re-load the new model, and run some sanity checks that the words make sense.

也许,使用newmodel.wv.save()newmodel.wvsave_word2vec_format()之类的东西来保存单词向量.

Perhaps, save off just the word-vectors, using something like newmodel.wv.save() or newmodel.wvsave_word2vec_format().

如果已恢复的newmodel看起来很完整,也可以对其进行修补以使用旧的syn1neg文件. 可能进一步训练修补的模型(无论是否使用旧的syn1neg).

Potentially, the resurrected newmodel could also be patched to use the old syn1neg file as well, if it appears complete. It might work to further train the patched model (either with or without having reused the older syn1neg).

另外:只有最大的语料库,或者缺少gensim cython优化的安装,或者一台机器没有足够的RAM(因此在训练期间进行交换),通常需要几天的训练时间.您也许可以运行得更快.检查:

Separately: only the very largest corpuses, or an installation missing the gensim cython optimizations, or a machine without enough RAM (and thus swapping during training), would usually require a training session taking days. You might be able to run much faster. Check:

  • 在整个培训期间是否发生任何虚拟内存交换?如果是这样,这对于培训吞吐量将是灾难性的,您应该使用具有更多RAM的计算机,或者在使用更高的min_count来修剪词汇/模型大小时要更加积极. (较小的min_count值表示一个较大的模型,较慢的训练,仅带有几个示例的单词质量较差的矢量,以及由于频率较高的单词受到干扰的稀有单词的干扰,对于较常见的单词也具有反直觉的较差质量的矢量.这是通常最好忽略最低频率的单词.)

  • Is any virtual-memory swapping happening during the entire training? If it is, it will be disastrous for training throughput, and you should use a machine with more RAM or be more aggressive about trimming the vocabulary/model size with a higher min_count. (Smaller min_count values mean a larger model, slower training, poor-quality vectors for words with just a few examples, and also counterintuitively worse-quality vectors for more-frequent words too, because of interference from the noisy rare words. It's usually better to ignore lowest-frequency words.)

是否显示有关正在使用慢版本"(纯Python,没有有效的多线程)的警告?如果是这样,您的培训将比解决该问题的速度慢约100倍.如果有经过优化的代码,则可能会在3到12之间的某个workers值(但决不大于机器CPU内核数)的情况下实现最大的训练吞吐量.

Is there any warning displayed about a "slow version" (pure Python with no effective multi-threading) being used? If so your training will be ~100X slower than if that problem is resolved. If the optimized code is available, maximum training throughput will likely be achieved with some workers value between 3 and 12 (but never larger than the number of machine CPU cores).

对于很大的语料库,可以使sample参数更具攻击性-例如1e-041e-05,而不是默认的1e-03 –它可能会加快训练通过避免对频繁出现的单词进行过多的冗余训练来提高向量的质量.

For a very large corpus, the sample parameter can be made more aggressive – such as 1e-04 or 1e-05 instead of the default 1e-03 – and it may both speed training and improve vector quality, by avoiding lots of redundant overtraining of the most-frequent words.

祝你好运!

这篇关于Gensim Word2Vec模型受过训练,但未保存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆