来自两个不同文本的同一个词的词嵌入 [英] Word embeddings for the same word from two different texts

查看:31
本文介绍了来自两个不同文本的同一个词的词嵌入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我为同一个词(比如猴子")计算 word2vec,一次基于 1800 年的一个大文本,另一次基于 2000 年的一个大文本,那么结果从我的角度来看,没有可比性.我对吗?为什么会这样?我有以下想法:过去的文本可能有完全不同的词汇,这就是问题所在.但是如何才能治愈它(使嵌入具有可比性)?

提前致谢.

解决方案

Word2Vec 模型中的任何词都没有正确"的位置——只是一个相对于其他词效果很好的位置单词和训练数据,经过一系列增量训练的推拉.实际上,每个模型都以低幅度随机位置的词向量开始,并且训练本身包括设计的随机性(例如通过随机选择将哪些词用作负面对比示例)和执行顺序随机性(由于操作系统有些随意的 CPU 调度选择,多个线程以略有不同的速度前进).

因此,您的sentences-from-1800"和sentences-from-2000"模型会有所不同,因为训练数据不同——可能是因为作者的使用方式不同,而且每个语料库都只是一个很小的部分所有现有用法的示例.而且:只需连续两次对samples-from-1800"语料库进行训练就会产生不同的模型!就单词相对于同一模型中其他单词的相对距离/位置而言,每个这样的模型应该与另一个模型一样好.但是单个单词的坐标可能会有很大的不同,并且没有可比性.

为了使单词在同一坐标空间中",必须采取额外的步骤.单词在同一个空间中最直接的方法是让它们在同一个模型中一起训练,它们交替出现在对比用法的例子中,包括与其他常用单词.

因此,例如,如果您需要将calenture"(热带热的旧词,可能不会出现在您的 2000 年代语料库中)与青霉素"(在 20 世纪发现)进行比较,那么您最好的选择是将两个语料混合成一个语料库并训练一个模型.如果每个词出现在两个时代出现的某些词附近,并且含义相对稳定,则它们的词向量可能具有可比性.

如果您只需要一个组合词向量来表示 'monkey',那么这种方法也可能适合您的目的.是的,一个词的意思会随着时间而变化.但即使在任何一个时间点,单词都是多义词:它们有多种含义.具有多种含义的单词的词向量往往会移动到每个替代含义之间的坐标.因此,即使猴子"的含义发生了变化,使用组合时代语料库仍可能为您提供一个猴子"的单一向量,合理地代表其在所有时代的平均含义.

如果您特别想模拟单词随时间的变化,那么您可能需要其他方法:

  • 您可能想要为时代构建单独的模型,但要学习它们之间的翻译,基于这样的想法,即某些单词可能变化很小,而其他单词可能会变化很多.(有一些方法可以使用某些假设具有相同含义的锚词"来学习不同 Word2Vec 模型之间的转换,然后将相同的转换应用于其他单词以将它们的坐标投影到另一个模型中.)

  • 或者,创建一个组合模型,但概率性地用特定于时代的替代标记替换您想要跟踪其含义变化的单词.(例如,您可以根据需要用 'monkey@1800' 和 'monkey@2000' 替换一部分出现的 'monkey',这样最终您会得到 三个 的词向量,用于 '猴子', 'monkey@1800', 'monkey@2000',让你比较不同的感觉.)

之前使用词向量跟踪意义的一些工作是HistWords"项目:

https://nlp.stanford.edu/projects/histwords/

If I calculate word2vec for the same word (say, "monkey"), one time on the basis of one large text from the year 1800 and another time on the basis of one large text from the year 2000, then the results would not be comparable from my point of view. Am I right? And why is it so? I have the following idea: the text from the past may have complete different vocabulary, which is the problem. But how one can then cure it (make embeddings comparable)?

Thanks in advance.

解决方案

There's no "right" position for any word in a Word2Vec model – just a position that works fairly well, in relation to other words and the training data, after a bunch of the pushes-and-pulls of the incremental training. Indeed, every model starts with word-vectors in low-magnitude random positions, and the training itself includes both designed-in randomness (such as via random choice of which words to use as negative contrastive examples) and execution-order randomness (as multiple threads make progress at slightly-different rates due to the operating system's somewhat-arbitrary CPU-scheduling choices).

So, your "sentences-from-1800" and "sentences-from-2000" models will differ because the training data is different – likely from both the fact that authors' usage varied, and that each corpus is just a tiny sample of all existing usage. But also: just training on the "samples-from-1800" corpus twice in a row will result in different models! Each such model should be about-as-good as the other, in terms of the relative distances/positions of words with respect to other words in the same model. But the coordinates of individual words could be very different, and non-comparable.

In order for words to be "in the same coordinate space", extra steps must be taken. The most direct way for words to be in the same space is for them to be trained together in the same model, with them appearing alternately in contrasting examples of usage, including with other common words.

So if for example you needed to compare 'calenture' (an old word for tropical fevers which might not appear in your 2000s corpus) to 'penicillin' (which was discovered in the 20th century), your best bet would be to shuffle together the two corpuses into a single corpus and train a single model. To the extent each word appeared near certain words that appeared in both eras, with relatively stable meaning, their word-vectors might then be comparable.

If you only need one combined word-vector for 'monkey', this approach may be fine your purposes, as well. Yes, a word's meaning drifts over time. But even at any single point in time, words are polysemous: they have multiple meanings. And word-vectors for words with many meanings tend to move to coordinates between each of their alternate meanings. So even if 'monkey' has drifted in meaning, it is still the case that using a combined-eras corpus would probably give you a single vector for 'monkey' that reasonably represents its average meaning over all eras.

If you specifically wanted to model words' changes-in-meaning over time, then you might need other approaches:

  • You might want to build separate models for eras, but learn translations between them, based on the idea that some words may change-little while others change-lots. (There are ways to use certain "anchor words", assumed to have the same meaning, to learn a transformation between separate Word2Vec models, then apply that same transformation to other words to project their coordinates in another model.)

  • Or, make a combined model, but probabilistically replace words whose changing-meanings you'd like to track with era-specific alternate tokens. (For example, you might replace some proportion of 'monkey' occurrences with 'monkey@1800' and 'monkey@2000', as appropriate, so that in the end you get three word-vectors for 'monkey', 'monkey@1800', 'monkey@2000', allowing you to compare the different senses.)

Some prior work on tracking meanings-over-time using word-vectors is the 'HistWords' project:

https://nlp.stanford.edu/projects/histwords/

这篇关于来自两个不同文本的同一个词的词嵌入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆