Gensim doc2vec文件流训练性能较差 [英] Gensim doc2vec file stream training worse performance

查看:688
本文介绍了Gensim doc2vec文件流训练性能较差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我切换到gensim 3.6,主要原因是优化了训练过程,该训练过程直接从文件中传输训练数据,从而避免了GIL性能的损失.

Recently I switched to gensim 3.6 and the main reason was the optimized training process, which streams the training data directly from file, thus avoiding the GIL performance penalties.

这就是我以前修剪过doc2vec的方式:

This is how I used to trin my doc2vec:

training_iterations = 20
d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
d2v.build_vocab(corpus)

for epoch in range(training_iterations):
    d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter)
    d2v.alpha -= 0.0002
    d2v.min_alpha = d2v.alpha

它对文档进行了很好的分类,唯一的缺点是经过培训的CPU使用率为70%

And it is classifying documents quite well, only draw back is that when it is trained CPUs are utilized at 70%

所以是新方法:

corpus_fname = "spped.data"
save_as_line_sentence(corpus, corpus_fname)

# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()

# Train models using all cores
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores, dm=0, vector_size=200, epochs=50)

现在所有CPU的利用率均为100%

,但是模型的效果非常差. 根据文档,我也不应该同时使用train方法,我应该仅使用时期计数而不是迭代,并且也不应使用min_aplpha和aplha值.

but the model is performing very poorly. According to the documentation, I should not use the train method also, I should use only epoch count and not iterations, also the min_aplpha and aplha values should not be touched.

两个Doc2Vec的配置在我看来都一样,因此我的新设置或配置是否有问题,或者新版本的gensim有问题吗?

The configuration of both Doc2Vec looks the same to me so is there an issue with my new set up or configuration, or there is something wrong with the new version of gensim?

P.S我在两种情况下都使用相同的语料,也尝试过历元= 100,也使用较小的数字(如5-20),但是我没有运气

P.S I am using the same corpus in both cases, also I tried epoch count = 100, also with smaller numbers like 5-20, but I had no luck

编辑:第一个模型每个迭代进行5个时代,共20个迭代,第二个模型进行50个时代,因此让第二个模型进行100个时代,使其表现更好,因为我不再管理Alpha.独自一人.

EDIT: First model was doing 20 iterations 5 epoch each, second was doing 50 epoch, so having the second model make 100 epochs made it perform even better, since I was no longer managing the alpha by myself.

关于弹出的第二个问题:向文件提供行文档时,文档ID并不总是与行相对应,我没有设法弄清楚是什么原因造成的,对于小规模的用户来说似乎可以正常工作语料库,如果我发现自己在做什么错,我将更新此答案.

About the second issue that popped up: when providing file with line documents, the doc ids were not always corresponding to the lines, I didn't manage to figure out what could be causing this, it seems to work fine for small corpus, If I find out what I am doing wrong I will update this answer.

大小为4GB的语料库的最终配置如下:

The final configuration for corpus of size 4GB looks like this

    d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
    d2v.build_vocab(corpus)
    d2v.train(corpus, total_examples=d2v.corpus_count, epochs=100)

推荐答案

大多数用户在自己的循环中尝试管理alpha&的循环中不应多次调用train().自己迭代.错误做起来太容易了.

Most users should not be calling train() more than once in their own loop, where they try to manage the alpha & iterations themselves. It is too easy to do it wrong.

特别是,您在循环中调用train()的代码做错了.无论您以此代码为模型的任何在线资源或教程,都应该停止咨询,因为它具有误导性或已过时. (与gensim捆绑在一起的笔记本是可以作为任何代码基础的更好的示例.)

Specifically, your code where you call train() in a loop is doing it wrong. Whatever online source or tutorial you modeled this code on, you should stop consulting, as it's misleading or outdated. (The notebooks bundled with gensim are better examples on which to base any code.)

更具体地说:您的循环代码实际上是对数据进行100次传递,对您的外部循环进行20次传递,然后对每个train()调用的默认d2v.iter进行5次.并且您的第一个train()调用正在将有效alpha从0.025平滑衰减到0.00025,减少了100倍.但是随后您的下一个train()调用将0.0248的固定alpha用于5次传递.然后是0.0246,依此类推,直到最后一个循环在alpha=0.0212处通过5次–甚至还没有起始值的80%.也就是说,在您的训练开始时便已达到最低的Alpha值.

Even more specifically: your looping code is actually doing 100 passes over the data, 20 of your outer loops, then the default d2v.iter 5 times each call to train(). And your first train() call is smoothly decaying the effective alpha from 0.025 to 0.00025, a 100x reduction. But then your next train() call uses a fixed alpha of 0.0248 for 5 passes. Then 0.0246, etc, until your last loop does 5 passes at alpha=0.0212 – not even 80% of the starting value. That is, the lowest alpha will have been reached early in your training.

除了指定corpus_file的方式(而不是可迭代的语料库)以外,完全相同地调用这两个选项.

Call the two options exactly the same except for the way the corpus_file is specified, instead of an iterable corpus.

您应该从两种语料库形式中获得相似的结果. (如果您有一个可重现的测试用例,其中相同的语料库获得的质量结果非常不同,并且没有其他错误,则可能值得将其报告给gensim作为错误.)

You should get similar results from both corpus forms. (If you had a reproducible test case where the same corpus gets very different-quality results, and there wasn't some other error, that could be worth reporting to gensim as a bug.)

如果两种方法的结果都不如错误地管理train()alpha时好,那可能是因为您没有进行相当数量的总体培训.

If the results for both aren't as good as when you were managing train() and alpha wrongly, it would likely be because you aren't doing a comparable amount of total training.

这篇关于Gensim doc2vec文件流训练性能较差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆