使用ElasticSearch进行模糊重复搜索 [英] Fuzzy duplicate search with ElasticSearch

查看:220
本文介绍了使用ElasticSearch进行模糊重复搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的N个文档数据集,其中少于1%的要重复识别.我有许多数字字段和一些文本字段.如果...,我认为数据集中的两个文档将关闭.

I have a rather big dataset of N documents with less than 1% of them being near-duplicate which I want to identify. I have many number fields, and a few text fields. I consider two documents in the data set close if...

  1. 除了一个,两个或三个数据字段外,其他所有字段都是完全相同的.
  2. 两个文档的相应文本字段仅需进行少量编辑(即 Levensthein距离由ElasticSearch使用.)
  1. all but one, two or three data fields are fully identical.
  2. corresponding text fields of two documents are only a few edits away (that's the Levensthein distance used by ElasticSearch).

您将如何应对使用ElasticSearch识别模糊重复项的这一挑战?

How would you approach this challenge of identifying fuzzy duplicates with ElasticSearch?

我已经很难为第(1)部分编写(通用)ElasticSearch查询,该查询未明确使用字段名称.我是否真的必须对以下模式进行庞大的查询,还是有更聪明的方法?

I already struggle to write a (general) ElasticSearch query for part (1), which does not explicitly use the field names. Do I really have to build a huge query of the following pattern, or is there a smarter way?

( SELECT * FROM MessyData AS T1
  JOIN MessyData AS T2
  WHERE T1.F1 != T1.F1 AND T1.F2 = T2.F2 AND T1.F3 = T2.F3 AND ... )
UNION ALL
( SELECT * FROM MessyData AS T1
  JOIN MessyData AS T2
  WHERE T1.F1 = T1.F1 AND T1.F2 != T2.F2 AND T1.F3 = T2.F3 AND ... )
UNION ALL
( SELECT * FROM MessyData AS T1
  JOIN MessyData AS T2
  WHERE T1.F1 = T1.F1 AND T1.F2 = T2.F2 AND T1.F3 != T2.F3 AND ... )
UNION ALL 
( ... )

注意:对于一个字段以外的所有字段都相同的情况,我使用了SQL伪代码来说明我的意思. F代表字段,T代表表,但这将是ElasticSearch中的索引.

Note: I used SQL pseudocode to show what I mean for the case where all except one fields are identical. F stands for field, T for table, but it would be an index in ElasticSearch.

计算树状图,或使用另一种相似性度量来比较彼此给我的每个文档N·(N-1)的计算量,因此不可行.

Calculating dendrograms or using another similarity measure which compares each document which every other one gives me a computational effort of N·(N-1) and is therefore not feasible.

我正在为问题的第二部分考虑的方法是使用m测试文档(其中mN小得多)测试我的数据集,将ElasticSearch在所有m上的得分相加查询.这将使我得到O(m·N)作为计算工作量,但是我仍然必须至少部分地或即时地对所有N得分总和进行排序.

The approach I am considering for the 2nd part of the issue is to probe my data set with m test documents (where m is much smaller than N), add up ElasticSearch's score over all m queries. That would give me O(m·N) as computational effort, but I still would have to sort all N score sums, at least partially, or on the fly.

是否存在除More Like ThisFuzzy Query之外的其他算法?也欢迎链接到科学论文!

Are there algorithms other than More Like This or Fuzzy Query for this problem? Links to scientific papers are also appreciated!

  • https://en.wikipedia.org/wiki/Data_deduplication just as an introduction
  • https://discuss.elastic.co/t/finding-documents--almost--the-same/66089/2
  • https://discuss.elastic.co/t/using-fuzzy-query-to-find-near-duplicates/39075 - a question on the forum without any answer
  • https://www.compose.com/articles/how-scoring-works-in-elasticsearch/
  • https://betterexplained.com/articles/sorting-algorithms/ for order of the different standard searching algorithms

推荐答案

我建议将字段分为4组的快速而肮脏的方法.计算每个字段组的哈希值.除非您在这四个度量之一上具有相同的哈希值,否则您不可能是重复的.

I would suggest the quick and dirty approach of dividing your fields into 4 groups. Compute a hash of each group of fields. Unless you have identical hashes on one of these four measures, you can't be a near duplicate.

如果运气好的话,这个技巧将意味着您只需要用相对较少的其他文件(在四分之一的字段上完全匹配)就可以计算任何给定的文件.

With luck, this trick will mean that you only need to compute any given document with a relatively small number of others that were an exact match on a quarter of the fields.

如果一团在同一哈希上匹配"太大,则可以用那些不属于团的字段重复该技巧,以期减少需要完成的工作量.

If a clump of "matches on the same hash" is too big, you can repeat the trick with the fields that weren't part of the clump in the hope of reducing how much work needs to be done.

这篇关于使用ElasticSearch进行模糊重复搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆