两份文件清单之间的相似性 [英] Similarity between two lists of documents

查看:128
本文介绍了两份文件清单之间的相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在Python的两个短文本列表之间找到相似之处. 文字可以是1-4个字长.每个列表的长度可以是10K. 我没有找到如何在spaCy中有效地执行此操作.也许其他软件包可以做到这一点? 我假设单词由向量(300d)表示,但是其他任何选项也都可以. 此任务可以循环执行,但是应该有一种更有效的方法来确保.此任务适合TensorFlow,pyTorch和类似的程序包,但我不熟悉这些程序包的详细信息.

I need to find the similarity between two lists of the short texts in Python. Texts can be 1-4 word long. The length of the lists can be 10K each. I didn't find how to do this effectively in spaCy. Maybe other packages can do this? I assume the words are represented by a vector (300d), but any other options are also Ok. This task can be done in a cycle, but there should be a more effective way for sure. This task fits the TensorFlow, pyTorch, and similar packages, but I'm not familiar with details of these packages.

推荐答案

我认为您的问题是模棱两可的-您可能要针对列表1的平均值与列表2的平均值的相似性生成一个相似性评分.假设您要为两个列表中的每个项目组合都设置一个相似度得分.对于每个列表1万个项目,将产生1万个战俘2 = 100M相似度得分.

I think your question is ambiguous - You might mean to produce a single similarity score for the similarity of the average of list 1 vs the average of list 2. I'm assuming that you want a similarity score for each combination of items from the two lists. For 10K items per list, that will produce 10K pow 2 = 100M similarity scores.

import spacy
spacyModel = spacy.load('en')

list1 = ["hello, example 1", "right, second example"]
list2 = ["hello, example 1 in the second list", "And now for something completely different"]

list1SpacyDocs = [spacyModel(x) for x in list1]
list2SpacyDocs = [spacyModel(x) for x in list2]

similarityMatrix = [[x.similarity(y) for x in list1SpacyDocs] for y in list2SpacyDocs]

print(similarityMatrix)
[[0.8537950408055295, 0.8852732956832498], [0.5802435148988874, 0.7643245611465626]]

这篇关于两份文件清单之间的相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆