我什么时候应该考虑使用预训练模型 word2vec 模型权重? [英] When should I consider to use pretrain-model word2vec model weights?

查看:52
本文介绍了我什么时候应该考虑使用预训练模型 word2vec 模型权重?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我的语料库相当大 - 有数万个独特的词.我可以使用它直接构建 word2vec 模型(下面代码中的方法 #1),也可以使用预先训练的模型权重初始化新的 word2vec 模型,并使用我自己的语料库对其进行微调(方法 #2).方法#2 值得考虑吗?如果是这样,是否有关于何时考虑使用预训练模型的经验法则?

Suppose my corpus is reasonably large - having tens-of-thousands of unique words. I can either use it to build a word2vec model directly(Approach #1 in the code below) or initialize a new word2vec model with pre-trained model weights and fine tune it with my own corpus(Approach #2). Is the approach #2 worth consideration? If so, is there a rule of thumb on when I should consider a pre-trained model?

# Approach #1
from gensim.models import Word2Vec
model = Word2Vec(my_corpus, vector_size=300, min_count=1)

# Approach #2
model = Word2Vec(vector_size=300, min_count=1)
model.build_vocab(my_corpus)
model.intersect_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True, lockf=1.0)
model.train(my_corpus, total_examples=len(my_corpus))

推荐答案

此类问题的一般答案是:您应该同时尝试这两种方法,看看哪种更适合您的目的.

The general answer to this type of question is: you should try them both, and see which works better for your purposes.

没有您的确切数据项目目标可以确定哪个更适合您的情况,并且您需要完全相同的能力来评估替代选择,以对您的工作进行各种非常基本的必要调整.

No one without your exact data & project goals can be sure which will work better in your situation, and you'll need to exact same kind of ability-to-evaluate alterante choices to do all sorts of very basic, necessary tuning of your work.

单独:

  • 微调"word2vec-vectors 可能意味着很多东西,并且可以引入许多专家级的棘手权衡决策 - 只有当您有一种强大的方法来相互测试不同的选择时才能导航这些权衡.
  • 您的代码显示的特定简单调整方法 - 依赖于可能在最新的 Gensim 中不起作用的实验方法 (intersect_word2vec_format()) - 非常有限,因为它会丢弃所有单词在您自己的语料库中尚未包含的外部向量中,也丢弃了人们经常想要混合旧向量的主要原因之一 - 以涵盖更多不在其训练数据中的单词.(我怀疑这种方法在许多情况下是否有用,但如上所述,为了确保您想针对您的数据/目标进行尝试.
  • 在 word2vec & 中使用 min_count=1 几乎总是一个坏主意.类似的算法.如果这些稀有单词真的很重要,请找到更多训练示例,以便为它们训练好的向量.但是如果没有足够的训练示例,通常最好忽略它们 - 保留它们甚至会使周围单词的向量变得更糟.
  • "fine-tuning" word2vec-vectors can mean many things, and can introduce a number of expert-leve thorny tradeoff-decisions - the sorts of tradeoffs that can only be navigated if you've got a robust way to test different choices against each other.
  • The specific simple tuning approach your code shows - which relies on an experimental method (intersect_word2vec_format()) that might not work in the latest Gensim – is pretty limited, and since it discards all the words in the outside vectors that aren't already in your own corpus, also discards one of the major reasons people often want to mix older vectors in - to cover more words not in their training data. (I doubt that approach will be useful in many cases, but as per above, to be sure you'd want to try it with respect to your data/goals.
  • It's almost always a bad idea to use min_count=1 with word2vec & similar algorithms. If such rare words are truly important, find more training examples so good vectors can be trained for them. But without enough training examples, they're usually better to ignore - keeping them even makes the vectors for surrounding words worse.

这篇关于我什么时候应该考虑使用预训练模型 word2vec 模型权重?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆