计算数百万个文档之间的相似性度量 [英] Calculating similarity measure between millions of documents

查看:58
本文介绍了计算数百万个文档之间的相似性度量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几百万个文档(接近一亿个),每个文档都有skillshobbiescertification等字段>教育.我想找到每个文档之间的相似性以及分数.

I have millions of documents(close to 100 million), each document has fields such as skills, hobbies, certification and education. I want to find similarity between each document along with a score.

以下是数据示例.

skills  hobbies        certification    education
Java    fishing        PMP              MS
Python  reading novel  SCM              BS
C#      video game     PMP              B.Tech.
C++     fishing        PMP              MS

所以我想要的是第一行和所有其他行之间的相似性,第二行和所有其他行之间的相似性等等.因此,每个文档都应该与其他每个文档进行比较.获得相似度分数.

so what i want is similarity between first row and all other rows, similarity between second row and all other rows and so on. So, every document should be compared against every other document. to get the similarity scores.

目的是我查询我的数据库以根据技能获取人员.除此之外,我现在想要那些即使没有技能但与具有特定技能的人有些匹配的人.例如,如果我想获取具有 JAVA 技能的人的数据,则会出现第一行,然后再次出现最后一行,与基于相似度得分的第一行相同.

Purpose is that i query my database to get people based on skills. In addition to that, i now want people who even though do not have the skills, but are somewhat matching with the people with the specific skills. For example, if i wanted to get data for people who have JAVA skills, first row will appear and again, last row will appear as it is same with first row based on similarity score.

挑战:我的主要挑战是计算每个文档与其他每个文档的相似度得分,如下面的伪代码所示.我怎样才能更快地做到这一点?使用此伪代码是否有任何不同的方法可以做到这一点,或者是否有任何其他计算(硬件/算法)方法可以更快地做到这一点?

Challenge: My primary challenge is to compute some similarity score for each document against every other document as you can see from below pseudo code. How can i do this faster? Is there any different way to do this with this pseudo code or is there any other computational(hardware/algorithm) approach to do this faster?

document = all_document_in_db
For i in document:
   for j in document:
      if i != j :
        compute_similarity(i,j)

推荐答案

一种加快速度的方法是确保不要同时计算相似性.您当前的伪代码会将 ij ji 进行比较.不是在整个文档上迭代 j,而是在 document[i+1:] 上迭代,即仅在 i 之后的条目.这会将您对 compute_similarity 的调用减少一半.

One way to speed up would be to ensure you don't calculate similarity both ways. your current pseudocode will compare i to j and j to i. instead of iterating j over the whole document, iterate over document[i+1:], i.e. only entries after i. This will reduce your calls to compute_similarity by half.

最适合这种比较的数据结构是邻接矩阵.这将是一个 n * n 矩阵(n 是数据集中的成员数),其中 matrix[i][j]是成员ij 之间的相似度.您可以完全填充这个矩阵,同时仍然只对 j 进行一半迭代,只需同时分配 matrix[i][j]matrix[j][i]] 一次调用 compute_similarity.

The most suitable data structure for this kind of comparison would be an adjacency matrix. This will be an n * n matrix (n is the number of members in your data set), where matrix[i][j] is the similarity between members i and j. You can populate this matrix fully while still only half-iterating over j, by just simultaneously assigning matrix[i][j] and matrix[j][i] with one call to compute_similarity.

除此之外,我想不出有什么办法可以加快这个过程;您至少需要对 compute_similarity 进行 n * (n - 1)/2 次调用.把它想象成一个握手问题;如果每个成员都必须与('握手')其他每个成员至少进行一次比较,那么下限是 n * (n - 1)/2.但我欢迎其他输入!

Beyond this, I can't think of any way to speed up this process; you will need to make at least n * (n - 1) / 2 calls to compute_similarity. Think of it like a handshake problem; if every member must be compared to ('shake hands with') every other member at least once, then the lower bound is n * (n - 1) / 2. But I welcome other input!

这篇关于计算数百万个文档之间的相似性度量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆