跨多种语言的语义相似性 [英] Semantic Similarity across multiple languages

查看:121
本文介绍了跨多种语言的语义相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用词嵌入来查找两个句子之间的相似性.使用word2vec,如果一个句子是英语,另一个句子是荷兰语,我也会得到相似度(尽管不是很好).

I am using word embeddings for finding similarity between two sentences. Using word2vec, I also get a similarity measure if one sentence is in English and the other one in Dutch (though not very good).

所以我开始怀疑是否有可能计算两种不同语言(没有显式翻译)中两个句子之间的相似度,特别是如果这些语言具有某些相似度(英语/荷兰语)吗?

So I started wondering if it's possible to compute the similarity between two sentences in two different languages (without an explicit translation), especially if the languages have some similarities (Englis/Dutch)?

推荐答案

让我们假设您的句子相似度方案仅使用词向量作为输入-就像简单的词向量平均方案或词移动器的距离"一样.

Let's assume that your sentence-similarity scheme uses only word-vectors as an input – as in simple word-vector averaging schemes, or Word Mover's Distance.

只要有可能,就可以按照您的建议去做

It should be possible to do what you've suggested, provided that:

  • 对于每种语言的单词,您都有很好的单词向量集
  • 单词向量的坐标空间是兼容的,这意味着两种语言中完全相同的事物的单词具有几乎相同的坐标(而其他含义相似的单词具有接近的坐标)

不能自动确保第二质量.实际上,考虑到word2vec模型的随机初始化以及算法/实现引入的其他随机化,即使随后对完全相同的数据进行的训练也不会将单词放置在完全相同的位置.因此,在完全分开的英语/荷兰语语料库上训练的词向量不可能将相同的词放在相同的坐标上.

That second quality is not automatically assured. In fact, given the random initialization of word2vec models, and other randomization introduced by the algorithm/implementation, even subsequent training runs on the exact same data won't place words into the exact same places. So word-vectors trained on totally-separate English/Dutch corpuses won't likely place equivalent words at the same coordinates.

但是,您可以基于某些锚点/参考词对(您知道应该具有相似的向量)来学习两个空间之间的代数变换.然后,您可以将该转换应用于两组中的一组中的所有单词,从而使您在规范"字集的可比较坐标空间内具有这些外来"字的向量.

But, you can learn an algebraic-transformation between two spaces, based on certain anchor/reference word-pairs (that you know should have similar vectors). You can then apply that transformation to all words in one of the two sets, which results in you having vectors for those 'foreign' words within the comparable coordinate-space of the 'canonical' word-set.

实际上,这个想法是在第一批word2vec论文中使用的:

In fact this very idea was used in one of the first word2vec papers:

"利用机器翻译语言之间的相似性"

如果您要对一个语言单词向量集应用类似的转换,然后将这些转换后的向量用作句子向量方案的输入,则这些句子向量可能会与其中的句子向量具有一些有用的可比性另一种语言,是从同一座标空间中的字向量引导而来的.

If you were to apply a similar transformation on one of your language word-vector sets, then use those transformed vectors as inputs to your sentence-vector scheme, those sentence-vectors would likely have some useful comparability to sentence-vectors in the other language, bootstrapped from word-vectors in the same coordinate-space.

更新:有一个非常有趣的最新论文可以进行培训使用一个语料库,同时使用多种语言的单词向量,该语料库既包含每种单一语言的原始句子,又包含一组(较小的)对齐语句,已知这两种语言在这两种语言中的含义相同. Gensim尚不支持此模式,但是有关于支持该模式的讨论在将来的重构中.

Update: There's a very interesting recent paper that manages to train word-vectors in multiple languages simultaneously, using a corpus that includes both raw sentences in each single language, and a (smaller) set of aligned-sentences that are known to mean the same in both languages. Gensim doesn't yet support this mode, but there's discussion of supporting it in a future refactor.

这篇关于跨多种语言的语义相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆