Wor2vec微调 [英] Wor2vec fine tuning

查看:116
本文介绍了Wor2vec微调的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚接触word2vec.我需要微调我的word2vec模型.

I am new at working with word2vec. I need to fine tune my word2vec model.

我有2个数据集:data1和data2到目前为止,我所做的是:

I have 2 datasets: data1 and data2 what i did so far is :

model = gensim.models.Word2Vec(
        data1,
        size=size_v,
        window=size_w,
        min_count=min_c,
        workers=work)
model.train(data1, total_examples=len(data1), epochs=epochs)

model.train(data2, total_examples=len(data2), epochs=epochs)

这是正确的吗?我需要将学习到的重量存储在某个地方吗?

Is this correct? Do I need to store learned weights somewhere?

我检查了此答案这一个,但我不知道它是如何完成的.

I checked this answer and this one but I couldn't understand how it's done.

有人可以向我解释要遵循的步骤吗?

Can someone explain to me the steps to follow?

提前谢谢

推荐答案

请注意,如果您在<模型实例化.如果未在实例化中指定一个编号,则模型将使用默认编号epochs(5)在提供的语料库上完成其自己的内部build_vocab()train().

Note you don't need to call train() with data1 if you already provided data1 at the time of model instantiation. The model will have already done its own internal build_vocab() and train() on the supplied corpus, using the default number of epochs (5) if you haven't specified one in the instantiation.

微调"不是一个简单的过程,需要可靠的步骤来改进模型.这很容易出错.

"Fine-tuning" is not a simple process with reliable steps assured to improve the model. It's very error-prone.

尤其是,如果模型不知道data2中的单词,则将忽略它们. (可以选择使用参数update=True调用build_vocab()来扩展已知词汇,但是这些单词与早期单词并不完全相等.)

In particular, if words in data2 aren't already known to the model, they'll be ignored. (There's an option to call build_vocab() with the parameter update=True to expand the known vocabulary, but such words aren't really on full equal footing with earlier words.)

如果data2包含某些单词,但其他单词不包含,则仅通过附加训练来更新data2中的单词–从本质上讲,这些单词 可能会与其他仅包含单词的单词相提并论.出现在data1中. (只有在交错的共享培训课程中一起训练过的单词才会经过推拉"操作,最终使它们处于有用的安排中.)

If data2 includes some words, but not others, only those in data2 get updated via the additional training – which may essentially pull those words out of comparable alignment from other words that only appeared in data1. (Only the words trained together, in an interleaved shared training session, will go through the "push-pull" that in the end leaves them in useful arrangments.)

增量训练最安全的方法是将data1data2一起洗牌,并对所有数据进行连续训练:这样所有单词都将得到新的交错训练.

The safest course for incremental training would be to shuffle data1 and data2 together, and do the continued training on all the data: so that all words get new interleaved training together.

这篇关于Wor2vec微调的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆