所有文档之间的成对地球移动距离(word2vec表示形式) [英] Pairwise Earth Mover Distance across all documents (word2vec representations)

查看:69
本文介绍了所有文档之间的成对地球移动距离(word2vec表示形式)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否存在一个图书馆,该图书馆将整理一份文件清单,并计算出距离的nxn矩阵-提供word2vec模型的地方?我可以看到genism允许您在两个文档之间执行此操作-但我需要在所有文档之间进行快速比较.像sklearns cosine_similarity.

Is there a library that will take a list of documents and en masse compute the nxn matrix of distances - where the word2vec model is supplied? I can see that genism allows you to do this between two documents - but I need a fast comparison across all docs. like sklearns cosine_similarity.

推荐答案

移词器的距离"(应用于单词向量组的移土器的距离)是一个相当复杂的优化计算,它取决于每个文档中的每个单词.

The "Word Mover's Distance" (earth-mover's distance applied to groups of word-vectors) is a fairly involved optimization calculation dependent on every word in each document.

我不知道有什么技巧可以帮助它在一次计算多个(甚至到同一文档的距离很多)时更快地运行.

I'm not aware of any tricks that would help it go faster when calculating many at once – even many distances to the same document.

因此,计算成对距离所需的唯一事情是嵌套循环,以考虑每个(忽略顺序的唯一)配对.

So the only thing needed to calculate pairwise distances are nested loops to consider each (order-ignoring unique) pairing.

例如,假设您的文档列表(每个单词列表)是 docs model 中的gensim词向量模型和numpy 导入为 np ,您可以使用以下公式计算成对距离D的数组:

For example, assuming your list of documents (each a list-of-words) is docs, a gensim word-vector model in model, and numpy imported as np, you could calculate the array of pairwise distances D with:

D = np.zeros((len(docs), len(docs)))
for i in range(len(docs)):
    for j in range(len(docs)):
        if i == j:
            continue  # self-distance is 0.0
        if i > j:
            D[i, j] = D[j, i]  # re-use earlier calc
        D[i, j] = model.wmdistance(docs[i], docs[j])

可能要花一点时间,但是您将在数组D中拥有所有成对的距离.

It may take a while, but you'll then have all pairwise distances in array D.

这篇关于所有文档之间的成对地球移动距离(word2vec表示形式)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆