计算数百万个文档之间的相似性度量 [英] Calculating similarity measure between millions of documents

查看：58 发布时间：2021/7/16 18:31:59 python performance scalability similarity

本文介绍了计算数百万个文档之间的相似性度量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有几百万个文档(接近一亿个)，每个文档都有skills、hobbies、certification和等字段>教育.我想找到每个文档之间的相似性以及分数.

I have millions of documents(close to 100 million), each document has fields such as skills, hobbies, certification and education. I want to find similarity between each document along with a score.

以下是数据示例.

skills  hobbies        certification    education
Java    fishing        PMP              MS
Python  reading novel  SCM              BS
C#      video game     PMP              B.Tech.
C++     fishing        PMP              MS

所以我想要的是第一行和所有其他行之间的相似性，第二行和所有其他行之间的相似性等等.因此，每个文档都应该与其他每个文档进行比较.获得相似度分数.

so what i want is similarity between first row and all other rows, similarity between second row and all other rows and so on. So, every document should be compared against every other document. to get the similarity scores.

目的是我查询我的数据库以根据技能获取人员.除此之外，我现在想要那些即使没有技能但与具有特定技能的人有些匹配的人.例如，如果我想获取具有 JAVA 技能的人的数据，则会出现第一行，然后再次出现最后一行，与基于相似度得分的第一行相同.

Purpose is that i query my database to get people based on skills. In addition to that, i now want people who even though do not have the skills, but are somewhat matching with the people with the specific skills. For example, if i wanted to get data for people who have JAVA skills, first row will appear and again, last row will appear as it is same with first row based on similarity score.

挑战:我的主要挑战是计算每个文档与其他每个文档的相似度得分，如下面的伪代码所示.我怎样才能更快地做到这一点?使用此伪代码是否有任何不同的方法可以做到这一点，或者是否有任何其他计算(硬件/算法)方法可以更快地做到这一点?

Challenge: My primary challenge is to compute some similarity score for each document against every other document as you can see from below pseudo code. How can i do this faster? Is there any different way to do this with this pseudo code or is there any other computational(hardware/algorithm) approach to do this faster?

document = all_document_in_db
For i in document:
   for j in document:
      if i != j :
        compute_similarity(i,j)

计算数百万个文档之间的相似性度量 [英] Calculating similarity measure between millions of documents

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

计算数百万个文档之间的相似性度量 [英] Calculating similarity measure between millions of documents

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭