使用相同的_uid在Elasticsearch索引中复制文档 [英] Duplicate documents in Elasticsearch index with the same _uid

查看:97
本文介绍了使用相同的_uid在Elasticsearch索引中复制文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在我们的Elasticsearch索引之一中发现了一些重复的文档,但我们无法找出原因。每个受影响的文档都有两个副本,并且它们具有完全相同的 _id _type _uid 字段。

We've discovered some duplicate documents in one of our Elasticsearch indices and we haven't been able to work out the cause. There are two copies of each of the affected documents, and they have exactly the same _id, _type and _uid fields.

/ index-name / document-type / document-id的GET请求仅返回一个副本,但是使用这样的查询搜索文档将返回两个结果,这非常令人惊讶:

A GET request to /index-name/document-type/document-id just returns one copy, but searching for the document with a query like this returns two results, which is quite surprising:

POST /index-name/document-type/_search
{
  "filter": {
    "term": {
      "_id": "document-id"
    }
  }
}

_uid 字段还标识重复的文档:

Aggregating on the _uid field also identifies the duplicate documents:

POST /index-name/_search
{
  "size": 0,
  "aggs": {
    "duplicates": {
      "terms": {
        "field": "_uid",
        "min_doc_count": 2
      }
    }
  }
}

重复项都位于不同的分片上。例如,一个文档可能在主分片0上有一个副本,而在主分片1上有一个副本。我们已经通过使用首选项参数:它在单个分片中找不到任何重复项。

The duplicates are all on different shards. For example, a document might have one copy on primary shard 0 and one copy on primary shard 1. We've verified this by running the aggregate query above on each shard in turn using the preference parameter: it does not find any duplicates within a single shard.

我们的最佳猜测是路由出现了问题,但我们不了解如何将副本路由到不同的分片。根据路由文档,默认路由基于文档ID,并且应始终将文档路由到同一分片。

Our best guess is that something has gone wrong with the routing, but we don't understand how the copies could have been routed to different shards. According to the routing documentation, the default routing is based on the document ID, and should consistently route a document to the same shard.

我们没有使用自定义路由参数来覆盖默认路由。我们通过确保重复的文档没有 _routing 字段来进行双重检查。

We are not using custom routing parameters that would override the default routing. We've double-checked this by making sure that the duplicate documents don't have a _routing field.

我们也没有定义任何也会影响路由的父/子关系。 (请参阅 Elasticsearch中的此问题例如,论坛的症状与我们的问题相同。我们认为原因不一样,因为我们没有设置任何文档父级。

We also don't define any parent/child relationships which would also affect routing. (See this question in the Elasticsearch forum, for example, which has the same symptoms as our problem. We don't think the cause is the same because we're not setting any document parents).

我们通过将索引重新索引到新索引中来解决了紧迫的问题,该新索引压缩了重复的文档。我们仍然有用于调试的旧索引。

We fixed the immediate problem by reindexing into a new index, which squashed the duplicate documents. We still have the old index around for debugging.

我们还没有找到复制问题的方法。新索引正确地索引了文档,我们尝试重新运行一个通宵处理作业,该作业也更新了文档,但是它没有创建更多重复项。

We haven't found a way of replicating the problem. The new index is indexing documents correctly, and we've tried rerunning an overnight processing job which also updates documents but it hasn't created any more duplicates.

集群具有3个节点,3个主分片和1个副本(即3个副本分片)。 minimum_master_nodes 设置为2,这样可以防止裂脑问题。我们正在运行Elasticsearch 2.4(我们知道它已经很旧了,我们打算很快进行升级)。

The cluster has 3 nodes, 3 primary shards and 1 replica (i.e. 3 replica shards). minimum_master_nodes is set to 2, which should prevent the split-brain issue. We're running Elasticsearch 2.4 (which we know is old - we're planning to upgrade soon).

有人知道是什么原因造成这些重复吗?您对调试方法有任何建议吗?

Does anyone know what might cause these duplicates? Do you have any suggestions for ways to debug it?

推荐答案

我们找到了答案!问题在于索引意外地切换了用于路由的哈希算法,这导致一些更新的文档被存储在不同的碎片上,恢复为原始版本。

We found the answer! The problem was that the index had unexpectedly switched the hashing algorithm it used for routing, and this caused some updated documents to be stored on different shards to their original versions.

/ index-name / _settings 显示了以下内容:

"version": {
  "created": "1070599",
  "upgraded": "2040699"
},
"legacy": {
  "routing": {
    "use_type": "false",
    "hash": {
      "type": "org.elasticsearch.cluster.routing.DjbHashFunction"
    }
  }
}

1070599指的是Elasticsearch 1.7,而 2040699指的是是ES 2.4。

"1070599" refers to Elasticsearch 1.7, and "2040699" is ES 2.4.

该索引似乎已尝试从1.7升级到2.4,尽管事实上它已经在运行2.4。这是此处描述的问题: https://github.com/elastic/elasticsearch / issues / 18459#issuecomment-220313383

It looks like the index tried to upgrade itself from 1.7 to 2.4, despite the fact that it was already running 2.4. This is the issue described here: https://github.com/elastic/elasticsearch/issues/18459#issuecomment-220313383

我们认为这是触发更改的原因:

We think this is what happened to trigger the change:


  1. 返回当我们将索引从ES 1.7升级到2.4时,我们决定不就地升级Elasticsearch,因为这会导致停机。相反,我们创建了一个单独的ES 2.4集群。

  1. Back when we upgraded the index from ES 1.7 to 2.4, we decided not to upgrade Elasticsearch in-place, since that would cause downtime. Instead, we created a separate ES 2.4 cluster.

我们使用一种工具将数据加载到新集群中,该工具复制了所有索引设置以及数据,包括版本设置,其中您不应在ES中进行设置2.4

We loaded data into the new cluster using a tool that copied over all the index settings as well as the data, including the version setting which you should not set in ES 2.4.

在处理最近的问题时,我们碰巧关闭并重新打开了索引。这通常会保留所有数据,但是由于 version 设置不正确,它导致Elasticsearch认为升级已在处理中。

While dealing with a recent issue, we happened to close and reopen the index. This normally preserves all the data, but because of the incorrect version setting, it caused Elasticsearch to think that an upgrade was in processed.

ES会由于错误升级而自动设置 legacy.routing.hash.type 设置。这意味着在此之后建立索引的所有数据都使用旧的 DjbHashFunction 而不是用于路由的默认 Murmur3HashFunction

ES automatically set the legacy.routing.hash.type setting because of the false upgrade. This meant that any data indexed after this point used the old DjbHashFunction instead of the default Murmur3HashFunction which had been used to route the data originally.

这意味着将数据重新索引为新索引是解决该问题的正确方法。问题。新索引具有正确的版本设置,没有旧的哈希函数设置:

This means that reindexing the data into a new index was the right thing to do to fix the issue. The new index has the correct version setting and no legacy hash function settings:

"version": {
  "created": "2040699"
}

这篇关于使用相同的_uid在Elasticsearch索引中复制文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆