Wor2vec微调 [英] Wor2vec fine-tuning
问题描述
我需要微调word2vec模型。我有两个数据集, data1
和 data2
。
I need to fine-tune my word2vec model. I have two datasets, data1
and data2
.
我这样做了远是:
model = gensim.models.Word2Vec(
data1,
size=size_v,
window=size_w,
min_count=min_c,
workers=work)
model.train(data1, total_examples=len(data1), epochs=epochs)
model.train(data2, total_examples=len(data2), epochs=epochs)
这是正确的吗?我是否需要将学习到的权重存储在某个地方?
Is this correct? Do I need to store learned weights somewhere?
I checked this answer and this one but I couldn’t understand how it’s done.
有人可以向我解释要遵循的步骤吗?
Can someone explain to me the steps to follow?
推荐答案
请注意,不需要需要使用 data1 $ c调用
train()
$ c>(如果您在模型实例化时已经提供了 data1
)。该模型将使用默认值在提供的语料库上完成自己的内部 build_vocab()
和 train()
纪元
的数量(5)(如果您未在实例化中指定一个)。
Note you don't need to call train()
with data1
if you already provided data1
at the time of model instantiation. The model will have already done its own internal build_vocab()
and train()
on the supplied corpus, using the default number of epochs
(5) if you haven't specified one in the instantiation.
微调不是一个简单的过程,要保证可靠的步骤来改进模型。这很容易出错。
"Fine-tuning" is not a simple process with reliable steps assured to improve the model. It's very error-prone.
特别是,如果模型尚未知道 data2
中的单词,则会将其忽略。 (可以选择使用参数 update = True
调用 build_vocab()
来扩展已知词汇,但是这些词
In particular, if words in data2
aren't already known to the model, they'll be ignored. (There's an option to call build_vocab()
with the parameter update=True
to expand the known vocabulary, but such words aren't really on full equal footing with earlier words.)
如果 data2
包含一些单词,但不包含其他单词,只有 data2
中的那些词会通过额外的训练得到更新–这实际上可能会将那些 词与其他仅出现在 data1
。 (只有在交错的共享培训课程中一起训练过的单词才会经过推拉操作,最终使它们处于有用的安排中。)
If data2
includes some words, but not others, only those in data2
get updated via the additional training – which may essentially pull those words out of comparable alignment from other words that only appeared in data1
. (Only the words trained together, in an interleaved shared training session, will go through the "push-pull" that in the end leaves them in useful arrangments.)
最安全的增量训练课程是将 data1
和 data2
一起洗牌,然后对所有数据进行持续训练:这样所有单词都可以进行新的交错训练。
The safest course for incremental training would be to shuffle data1
and data2
together, and do the continued training on all the data: so that all words get new interleaved training together.
这篇关于Wor2vec微调的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!