怎么把Gensim Word2Vec模型转换成FastText模型? [英] How to convert gensim Word2Vec model to FastText model?

查看:273
本文介绍了怎么把Gensim Word2Vec模型转换成FastText模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Word2Vec模型,该模型是在庞大的语料库上训练的.在将这种模型用于神经网络应用程序时,我遇到了很多词汇量不足"的单词.现在,我需要为这些词汇量不足"的单词找到单词嵌入.因此,我进行了一次谷歌搜索,发现Facebook最近为此发布了FastText库.现在我的问题是如何将现有的word2vec模型或Keyedvectors转换为FastText模型?

I have a Word2Vec model which was trained on a huge corpus. While using this model for Neural network application I came across quite a few "Out of Vocabulary" words. Now I need to find word embeddings for these "Out of Vocabulary" words. So I did some googling and found that Facebook has recently released a FastText library for this. Now my question is how can I convert my existing word2vec model or Keyedvectors to FastText model?

推荐答案

FastText能够通过将来自原始语料库的子词片段包括在初始训练中来为子词片段创建向量.然后,当遇到词汇外('OOV')单词时,它将使用其识别的片段为这些单词构建一个向量.对于具有反复出现的词根/前缀/后缀模式的语言,这导致矢量比对OOV单词的随机猜测要好.

FastText is able to create vectors for subword fragments by including those fragments in the initial training, from the original corpus. Then, when encountering an out-of-vocabulary ('OOV') word, it constructs a vector for those words using fragments it recognizes. For languages with recurring word-root/prefix/suffix patterns, this results in vectors that are better than random guesses for OOV words.

但是,FastText进程不会 not 从最终的全字向量中提取这些子字向量.因此,没有简单的方法可以将全字向量转换为还包含子字向量的FastText模型.

However, the FastText process does not extract these subword vectors from final full-word vectors. Thus there's no simple way to turn full-word vectors into a FastText model that also includes subword vectors.

可能存在一种可行的方法来近似相同的效果,例如,通过将具有相同子词片段的所有已知词都提取出来,并提取一些要分配给该子词的共同平均值/矢量分量.或将OOV单词建模为词汇中单词的平均数,这些单词与OOV单词的编辑距离很短.但是这些技术并不能完全像FastText,只是模糊地类似于它,它们的效果如何,或者可以通过调整来使其工作,将是一个实验性的问题.因此,获取现成的库不是问题.

There might be workable way to approximate the same effect, for example by taking all known-words with the same subword fragment, and extracting some common average/vector-component to be assigned to the subword. Or modeling OOV words as some average of in-vocabulary words that are a short edit-distance from the OOV word. But these techniques wouldn't quite be FastText, just vaguely analogous to it, and how well they work, or could be made to work with tweaking, would be an experimental question. So, it's not a matter of grabbing an off-the-shelf library.

这篇由塞巴斯蒂安·鲁德(Sebastien Ruder)发表的博客文章.

如果您需要FastText OOV功能,最好的方法是在与传统全字向量相同的语料库上从头开始训练FastText向量.

If you need the FastText OOV functionality, the best-grounded approach would be to train FastText vectors from scratch on the same corpus as was used for your traditional full-word-vectors.

这篇关于怎么把Gensim Word2Vec模型转换成FastText模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆