Wor2vec微调 [英] Wor2vec fine-tuning

查看:82
本文介绍了Wor2vec微调的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要微调word2vec模型。我有两个数据集, data1 data2

I need to fine-tune my word2vec model. I have two datasets, data1 and data2.

我这样做了远是:

model = gensim.models.Word2Vec(
        data1,
        size=size_v,
        window=size_w,
        min_count=min_c,
        workers=work)
model.train(data1, total_examples=len(data1), epochs=epochs)

model.train(data2, total_examples=len(data2), epochs=epochs)

这是正确的吗?我是否需要将学习到的权重存储在某个地方?

Is this correct? Do I need to store learned weights somewhere?

我检查了此答案此问题

I checked this answer and this one but I couldn’t understand how it’s done.

有人可以向我解释要遵循的步骤吗?

Can someone explain to me the steps to follow?

推荐答案

请注意,不需要需要使用 data1 train() $ c>(如果您在模型实例化时已经提供了 data1 )。该模型将使用默认值在提供的语料库上完成自己的内部 build_vocab() train() 纪元的数量(5)(如果您未在实例化中指定一个)。

Note you don't need to call train() with data1 if you already provided data1 at the time of model instantiation. The model will have already done its own internal build_vocab() and train() on the supplied corpus, using the default number of epochs (5) if you haven't specified one in the instantiation.

微调不是一个简单的过程,要保证可靠的步骤来改进模型。这很容易出错。

"Fine-tuning" is not a simple process with reliable steps assured to improve the model. It's very error-prone.

特别是,如果模型尚未知道 data2 中的单词,则会将其忽略。 (可以选择使用参数 update = True 调用 build_vocab()来扩展已知词汇,但是这些词

In particular, if words in data2 aren't already known to the model, they'll be ignored. (There's an option to call build_vocab() with the parameter update=True to expand the known vocabulary, but such words aren't really on full equal footing with earlier words.)

如果 data2 包含一些单词,但不包含其他单词,只有 data2 中的那些词会通过额外的训练得到更新–这实际上可能会将那些 词与其他仅出现在 data1 。 (只有在交错的共享培训课程中一起训练过的单词才会经过推拉操作,最终使它们处于有用的安排中。)

If data2 includes some words, but not others, only those in data2 get updated via the additional training – which may essentially pull those words out of comparable alignment from other words that only appeared in data1. (Only the words trained together, in an interleaved shared training session, will go through the "push-pull" that in the end leaves them in useful arrangments.)

最安全的增量训练课程是将 data1 data2 一起洗牌,然后对所有数据进行持续训练:这样所有单词都可以进行新的交错训练。

The safest course for incremental training would be to shuffle data1 and data2 together, and do the continued training on all the data: so that all words get new interleaved training together.

这篇关于Wor2vec微调的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆