使用ElasticSearch进行模糊重复搜索 [英] Fuzzy duplicate search with ElasticSearch

查看：220 发布时间：2020/5/18 0:58:35 algorithm elasticsearch nlp duplicates

本文介绍了使用ElasticSearch进行模糊重复搜索的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很大的N个文档数据集，其中少于1％的要重复识别.我有许多数字字段和一些文本字段.如果...，我认为数据集中的两个文档将关闭.

I have a rather big dataset of N documents with less than 1% of them being near-duplicate which I want to identify. I have many number fields, and a few text fields. I consider two documents in the data set close if...

除了一个，两个或三个数据字段外，其他所有字段都是完全相同的.
两个文档的相应文本字段仅需进行少量编辑(即 Levensthein距离由ElasticSearch使用.)

all but one, two or three data fields are fully identical.
corresponding text fields of two documents are only a few edits away (that's the Levensthein distance used by ElasticSearch).

您将如何应对使用ElasticSearch识别模糊重复项的这一挑战?

How would you approach this challenge of identifying fuzzy duplicates with ElasticSearch?

我已经很难为第(1)部分编写(通用)ElasticSearch查询，该查询未明确使用字段名称.我是否真的必须对以下模式进行庞大的查询，还是有更聪明的方法?

I already struggle to write a (general) ElasticSearch query for part (1), which does not explicitly use the field names. Do I really have to build a huge query of the following pattern, or is there a smarter way?

( SELECT * FROM MessyData AS T1
  JOIN MessyData AS T2
  WHERE T1.F1 != T1.F1 AND T1.F2 = T2.F2 AND T1.F3 = T2.F3 AND ... )
UNION ALL
( SELECT * FROM MessyData AS T1
  JOIN MessyData AS T2
  WHERE T1.F1 = T1.F1 AND T1.F2 != T2.F2 AND T1.F3 = T2.F3 AND ... )
UNION ALL
( SELECT * FROM MessyData AS T1
  JOIN MessyData AS T2
  WHERE T1.F1 = T1.F1 AND T1.F2 = T2.F2 AND T1.F3 != T2.F3 AND ... )
UNION ALL 
( ... )

注意:对于一个字段以外的所有字段都相同的情况，我使用了SQL伪代码来说明我的意思. F代表字段，T代表表，但这将是ElasticSearch中的索引.

Note: I used SQL pseudocode to show what I mean for the case where all except one fields are identical. F stands for field, T for table, but it would be an index in ElasticSearch.

计算树状图，或使用另一种相似性度量来比较彼此给我的每个文档N·(N-1)的计算量，因此不可行.

Calculating dendrograms or using another similarity measure which compares each document which every other one gives me a computational effort of N·(N-1) and is therefore not feasible.

我正在为问题的第二部分考虑的方法是使用m测试文档(其中m比N小得多)测试我的数据集，将ElasticSearch在所有m上的得分相加查询.这将使我得到O(m·N)作为计算工作量，但是我仍然必须至少部分地或即时地对所有N得分总和进行排序.

The approach I am considering for the 2nd part of the issue is to probe my data set with m test documents (where m is much smaller than N), add up ElasticSearch's score over all m queries. That would give me O(m·N) as computational effort, but I still would have to sort all N score sums, at least partially, or on the fly.

是否存在除More Like This或Fuzzy Query之外的其他算法?也欢迎链接到科学论文！

Are there algorithms other than More Like This or Fuzzy Query for this problem? Links to scientific papers are also appreciated!

https://en.wikipedia.org/wiki/Data_deduplication 作为介绍
https://discuss.elastic .co/t/finding-documents--几乎--the-same/66089/2
https://discuss.elastic .co/t/using-fuzzy-query-to-find-near-duplicates/39075 -论坛上的一个问题，没有任何答案
https://www.compose.com/articles/如何在Elasticsearch中评分?
https://betterexplained.com/articles/sorting-algorithms/用于<不同标准搜索算法的href ="https://en.wikipedia.org/wiki/Big_O_notation" rel ="nofollow noreferrer">顺序

https://en.wikipedia.org/wiki/Data_deduplication just as an introduction
https://discuss.elastic.co/t/finding-documents--almost--the-same/66089/2
https://discuss.elastic.co/t/using-fuzzy-query-to-find-near-duplicates/39075 - a question on the forum without any answer
https://www.compose.com/articles/how-scoring-works-in-elasticsearch/
https://betterexplained.com/articles/sorting-algorithms/ for order of the different standard searching algorithms

使用ElasticSearch进行模糊重复搜索 [英] Fuzzy duplicate search with ElasticSearch

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用ElasticSearch进行模糊重复搜索 [英] Fuzzy duplicate search with ElasticSearch

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭