从 deeplearning4j 和 word2vec 得到不同的结果 [英] Getting different results from deeplearning4j and word2vec

查看:22
本文介绍了从 deeplearning4j 和 word2vec 得到不同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Google 的 word2vec 训练了一个词嵌入模型.输出是一个包含单词及其向量的文件.

I trained a word embedding model using Google's word2vec. The output is a file that contains a word and its vector.

我在 deeplearing4j 中加载了这个训练好的模型.

I loaded this trained model in deeplearing4j.

    WordVectors vec = WordVectorSerializer.loadTxtVectors(new File("vector.txt"));
    Collection<String> lst = vec.wordsNearest("someWord", 10);

但是从deeplearing4j的包和word2vec的距离函数中得到的两个相似词列表是完全不同的,虽然我用的是同一个向量文件.

But the two lists of similar words obtained from deeplearing4j's package and word2vec's distance function are totally different although I used the same vector file.

有没有人对 deeplaring4j 的工作原理以及这些差异的来源有很好的了解?

Does anyone have a good understanding on how things work in deeplaring4j and where these differences are coming from?

推荐答案

这些列表是否完全相似?有没有哪一组比相似词更合理?

Are the lists similar at all? Does either set seem more reasonable as similar words?

根据我的理解,列表应该几乎完全匹配——它们应该在相同的输入向量上实现相同的计算.如果他们不这样做,尤其是如果原始 word2vec.c 类似列表看起来更合理,那么我会怀疑 DL4J 中存在错误.

By my understanding, the lists should match almost exactly - they should be implementing the same calculation on the same input vectors. If they don't, and especially if the original word2vec.c similar-list looks more reasonable, then I would suspect a bug in DL4J.

看计算方法——https://github.com/deeplearning4j/deeplearning4j/blob/f943ea879ab362f66b57b00754b71fb2ff3677a1/deeplearningp/deeplearning/deeplearning/deeplearning4j/models/embeddings/wordvectors/WordVectorsImpl.java#L385 :

  • if (lookupTable() instanceof InMemoryLookupTable) {...} 分支的代码可能是正确的——我不熟悉 nd4j API——但对于计算来说似乎太华丽了排序余弦相似度值;
  • 后面的回退情况似乎没有使用单位向量归一化向量值(就像往常一样)——它使用 getWordVectorMatrix() 而不是 getWordVectorMatrixNormalized()
  • the code for the if (lookupTable() instanceof InMemoryLookupTable) {...} branch may be correct – I'm not familiar with the nd4j API – but almost seems too ornate for the calculation of ranked cosine-similarity values;
  • the fallback case that follows does not appear to use unit-vector normalized vector values (as would be usual) – it uses getWordVectorMatrix() instead of getWordVectorMatrixNormalized()

这篇关于从 deeplearning4j 和 word2vec 得到不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆