Elasticsearch:从索引中删除重复项 [英] Elasticsearch: Remove duplicates from index

查看:1113
本文介绍了Elasticsearch:从索引中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多个重复条目的索引。它们具有不同的ID,但其他字段具有相同的内容。

I have an index with multiple duplicate entries. They have different ids but the other fields have identical content.

例如:

{id: 1, content: 'content1'}
{id: 2, content: 'content1'}
{id: 3, content: 'content2'}
{id: 4, content: 'content2'}

删除重复项后:

{id: 1, content: 'content1'}
{id: 3, content: 'content2'}

是否可以删除所有重复项并仅保留一个不同的条目,而无需手动比较所有条目?

Is there a way to delete all duplicates and keep only one distinct entry without manually comparing all entries?

推荐答案

我使用rails,如有必要,我将使用 FORCE = y 命令导入内容,该命令将删除并重新索引该索引和类型的所有内容……但是不确定在什么环境中运行ES。唯一可以看到的问题是,您要从中导入的数据源(即数据库)是否有重复的记录。我想我首先会看到数据源是否可以固定,如果可行的话,然后重新索引所有内容;否则,您可以尝试创建一个自定义导入方法,该方法仅为每条记录索引一个重复项。

I use rails and if necessary I will import things with the FORCE=y command, which removes and re-indexes everything for that index and type... however not sure what environment you are running ES in. Only issue I can see is if the data source you are importing from (i.e. Database) has duplicate records. I guess I would see first if the data source could be fixed, if that is feasible, and you re-index everything; otherwise you could try to create a custom import method that only indexes one of the duplicate items for each record.

此外,我知道这与您想要的不符删除重复的条目,但是您可以简单地自定义搜索,以便仅返回最新重复的ID之一(通过最新的时间戳记或索引重复数据删除的数据并按内容字段分组),看看是否这篇文章有帮助。即使这仍然会在您的索引中保留重复的记录,但至少它们不会出现在搜索结果中。

Furthermore, and I know this doesn't comply with you wanting to remove duplicate entries, but you could simply customize your search so that you are only returning one of the duplicate ids back, either by most recent "timestamp" or indexing deduplicated data and grouping by your content field -- see if this post helps. Even though this would still retain the duplicate records in your index, at least they won't come up in the search results.

我也发现了这一点: Elasticsearch删除重复项

我试图考虑很多可能场景供您查看这些选项中的任何一项是否有效,或者至少可以是临时解决方案。

I tried thinking of many possible scenarios for you to see if any of those options work or at least could be a temp fix.

这篇关于Elasticsearch:从索引中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆