如何计算2个node2vec模型之间的距离 [英] how calculate distance between 2 node2vec model

查看:77
本文介绍了如何计算2个node2vec模型之间的距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个不同时间戳的 node2vec 模型.我想计算两个模型之间的距离.两个模型有相同的词汇,我们更新模型.

I have 2 node2vec models in different timestamps. I want to calculate the distance between 2 models. Two models have the same vocab and we update the models.

我的模特是这样的

model1:
"1":0.1,0.5,...
"2":0.3,-0.4,...
"3":0.2,0.5,...
.
.
.    
model2:
    "1":0.15,0.54,...
    "2":0.24,-0.35,...
    "3":0.24,0.47,...
    .
    .
    .

推荐答案

假设您已经使用标准的 word2vec 库来训练您的模型,每次运行都会引导一个完全独立的模型,其坐标不是必须与任何其他模型相媲美.

Assuming you've used a standard word2vec library to train your models, each run bootstraps a wholly-separate model whose coordinates are not necessarily comparable to any other model.

(由于算法中的一些固有随机性,或训练输入的多线程处理,即使在完全相同的数据上运行两个训练会话也会导致不同的模型.它们每个应该对下游应用程序同样有用,但单个令牌可以处于任意不同的位置.)

(Due to some inherent randomness in the algorithm, or in the multi-threaded handling of training input, even running two training sessions on the exact same data will result in different models. They should each be about as useful for downstream applications, but individual tokens could be in arbitrarily-different positions.)

也就是说,您可以尝试综合衡量两个模型的差异程度.例如,您可以:

That said, you could try to synthesize some measures of how much two models are different. For example, you might:

  • 选择一堆随机(或领域重要的)词对.分别检查每个模型中每对之间的相似性,然后比较模型之间的这些值.(也就是说,将 model1.similarity(token_a, token_b)model2.similarity(token_a, token_b) 进行比较.)将模型之间的差异视为一些所有测试的相似性差异的加权组合.

  • Pick a bunch of random (or domain-significant) word-pairs. Check the similarity between each pair, in each model individually, then compare those values between models. (That is, compare model1.similarity(token_a, token_b) with model2.similarity(token_a, token_b).) Consider the difference-between-the-models as as some weighted combination of all the tested similarity-differences.

对于一些重要的相关标记集,收集每个模型中前 N 个最相似的标记.通过某种等级相关度量比较这个列表,看看一个模型对每个标记的邻域"改变了多少.

For some significant set of relevant tokens, collect the top-N most-similar tokens in each model. Compare this lists via some sort of rank-correlation measure, to see how much one model has changed the 'neighborhoods' of each token.

对于其中的每一个,我建议根据完全相同的训练数据的基线案例验证它们的操作,这些训练数据已经过混洗和/或使用不同的起始随机 seed 进行训练.他们是否将此类模型显示为几乎等效"?如果没有,您需要调整训练参数或综合度量,直到它确实得到预期结果 - 即使标记具有非常不同的坐标,来自相同数据的模型也会被判断为相似.

For each of these, I'd suggest verifying their operation against a baseline case of the exact-same training data that's been shuffled and/or trained with a different starting random seed. Do they show such models as being "nearly equivalent"? If not, you'd need to adjust the training parameters or synthetic measure until it does have the expected result - that models from the same data are judged as alike, even though tokens have very different coordinates.

另一种选择可能是从合成语料库中训练一个巨大的组合模型,其中:

Another option might be to train one giant combined model from a synthetic corpus where:

  • 两个时代的所有原始未修改文本"都出现一次
  • 来自每个独立时代的文本再次出现,但它们的标记的一些随机比例被特定于时代的修饰符修改.(例如,'foo' 在第一纪元中有时会变成 'foo_1',而在第二纪元中有时会变成 'foo_2'文本.(您不想将任何一个文本中的所有标记转换为特定于时代的标记,因为只有同时出现的标记会相互影响,因此您需要来自任一文本的标记时代有时与常见/共享变体一起出现,但也经常与特定时代的变体一起出现.)
  • all the original unmodified 'texts' from both eras all appear once
  • texts from each separate era appear again, but with some random-proportion of their tokens modified with an era-specific modifier. (For example, 'foo' sometimes becomes 'foo_1' when in first-era texts, and sometimes becomes 'foo_2' in second-era texts. (You don't want to convert all tokens in any one text to era-specific tokens, because only tokens that co-appear with each other influence each other, and you thus want tokens from either era to sometimes appear with common/shared variants, but also often appear with era-specific variants.)

最后,原始标记'foo'会得到三个向量:'foo''foo_1''foo_2'.它们应该都非常相似,但是特定时代的变体会相对更多地受特定时代背景的影响.因此,这三者之间的差异(以及现在共同坐标空间中的相对运动)将表明两个时代数据之间发生的变化的幅度和种类.

At the end, the original token 'foo' will get three vectors: 'foo', 'foo_1', and 'foo_2'. They should all be quite similar, but the era-specific variants will be relatively more-influenced by the era-specific contexts. Thus the differences between those three (and relative movement in the now common coordinate space) will be an indication of the magnitude and kinds of changes that happened between the two eras' data.

这篇关于如何计算2个node2vec模型之间的距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆