组合/添加来自不同word2vec模型的向量 [英] Combining/adding vectors from different word2vec models

查看:88
本文介绍了组合/添加来自不同word2vec模型的向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用gensim创建在大型文本语料库上训练的Word2Vec模型.我有一些基于StackExchange数据转储的模型.我也有一个模型,该模型使用英语维基百科衍生的语料库进行训练.

I am using gensim to create Word2Vec models trained on large text corpora. I have some models based on StackExchange data dumps. I also have a model trained on a corpus derived from English Wikipedia.

假定两个模型中都存在词汇,并且使用与Word2Vec相同的参数创建了模型.是否有任何方法可以合并或添加来自两个单独模型的向量,以创建一个具有与我最初合并两个语料库并对此数据进行训练后产生的词向量相同的单词向量的新模型?

Assume that a vocabulary term is in both models, and that the models were created with the same parameters to Word2Vec. Is there any way to combine or add the vectors from the two separate models to create a single new model that has the same word vectors that would have resulted if I had combined both corpora initially and trained on this data?

我要执行此操作的原因是我希望能够生成具有特定语料库的模型,然后如果以后再处理新的语料库,则希望能够将此信息添加到现有模型中而不是必须合并语料库并从头开始重新训练所有内容(即,我想避免每次想向模型中添加信息时都重新处理每个语料库).

The reason I want to do this is that I want to be able to generate a model with a specific corpus, and then if I process a new corpus later, I want to be able to add this information to an existing model rather than having to combine corpora and retrain everything from scratch (i.e. I want to avoid reprocessing every corpus each time I want to add information to the model).

gensim或其他地方是否有内置函数可以让我组合这样的模型,将信息添加到现有模型中,而不是进行重新训练?

Are there builtin functions in gensim or elsewhere that will allow me to combine models like this, adding information to existing models instead of retraining?

推荐答案

通常,只有一起训练的单词向量才具有有意义的可比性. (正是训练过程中交错的拔河将它们移至有意义的相对方向,并且在过程中有足够的随机性,甚至在相同语料库上训练的模型在放置单个单词的位置也会有所不同.)

Generally, only word vectors that were trained together are meaningfully comparable. (It's the interleaved tug-of-war during training that moves them to relative orientations that are meaningful, and there's enough randomness in the process that even models trained on the same corpus will vary in where they place individual words.)

使用两个语料库中的单词作为指导,可以学习从一个空间A到另一个空间B的转换,该转换试图将那些已知的共享词移动到另一个空间中的相应位置.然后,将相同的变换应用于B中没有的A中的单词,您可以找到这些单词的B坐标,从而使其与其他本机B单词具有可比性.

Using words from both corpuses as guideposts, it is possible to learn a transformation from one space A to the other B, that tries to move those known-shared-words to their corresponding positions in the other space. Then, applying that same transformation to the words in A that aren't in B, you can find B coordinates for those words, making them comparable to other native-B words.

该技术已成功用于word2vec驱动的语言翻译(其中路标对是已知的翻译),或作为一种使用来自其他地方的单词向量来增长有限的单词向量集的方法.我不知道它是否可以很好地满足您的目的.我想这可能会误入歧途,特别是在两个训练语料库以完全不同的意义使用共享令牌的情况下.

This technique has been used with some success in word2vec-driven language translation (where the guidepost pairs are known translations), or as a means of growing a limited word-vector set with word-vectors from elsewhere. Whether it'd work well enough for your purposes, I don't know. I imagine it could go astray especially where the two training corpuses use shared tokens in wildly different senses.

gensim库中可能有一个类TranslationMatrix,可以为您完成此操作.参见:

There's a class, TranslationMatrix, that may be able to do this for you in the gensim library. See:

https://radimrehurek.com/gensim/models/translation_matrix.html

在以下位置有一个演示笔记本:

There's a demo notebook of its use at:

https://github.com/RaRe -Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

(只要可行,对带有所有单词示例的混合语料库进行全面培训可能会做得更好.)

(Whenever practical, doing a full training on a mixed-together corpus, with all word examples, is likely to do better.)

这篇关于组合/添加来自不同word2vec模型的向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆