elasticsearch允许重复的ID具有不同的身体数据 [英] elasticsearch allows duplicate ID with different body data

查看:36
本文介绍了elasticsearch允许重复的ID具有不同的身体数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正试图将我们的Elasticsearch数据迁移到2.0兼容(即字段名称中没有点),以准备从1.x升级到2.x.

我编写了一个程序,该程序以批处理方式遍历位于一个节点群集中的数据,并重命名字段,并使用Bulk API重新索引文档.

在某些时候,这一切都出错了,即使我的查询应该递减,从我的查询返回的文档总数(被震惊")也没有改变.

最初,我认为它不起作用.当我选择一个文档并查询它是否正在更改时,我可以看到它正在工作.

但是,当我在文档中查询该文档中的特定字段时,会得到两个具有相同ID的结果.结果之一是字段已升级,另一结果没有.

进一步检查,我发现它们来自不同的碎片:

  {接":2"timed_out":否,"_shards":{总计":5成功":5失败":0},点击数":{总计":2"max_score":19.059433,点击数":[{"_shard":0,"_node":"FxbpjCyQRzKfA9QvBbSsmA","_index":状态","_type":状态","_id":"http://static.photosite.com/80018335.jpg","_version":2"_score":19.059433,"_source":{"url":"http://static.photosite.com/80018335.jpg",元数据":{"url.path":["http://www.photosite.com/80018335"],源":["http://www.photosite.com/80018335"],经度":["104.507755"],纬度":["21.601669"]}}},...},{"_shard":3,"_node":"FxbpjCyQRzKfA9QvBbSsmA","_index":状态","_type":状态","_id":"http://static.photosite.com/80018335.jpg","_version":27,"_score":17.607681,"_source":{"url":"http://static.photosite.com/80018335.jpg",元数据":{"url_path":["http://www.photosite.com/80018335"],源":["http://www.photosite.com/80018335"],经度":["104.507755"],纬度":["21.601669"]}},...}} 

如何防止这种情况发生?

elasticsearch版本: 1.7.3

查询:

  {布尔":{必须" : {通配符":{"metadata.url.path":"*"}},禁止" : {通配符":{"metadata.url_path":"*"}}}} 

编写文档的代码:

  BulkRequestBuilder bulkRequest = destinationConnection.getClient().prepareBulk();for(Map< String,Object> doc:batch.getDocs()){XContentBuilder构建器;尝试 {builder = XContentFactory.jsonBuilder().startObject();for(Map.Entry< String,Object> mapEntry:doc.entrySet()){if(!mapEntry.getKey().equals("id")){builder.field(mapEntry.getKey(),mapEntry.getValue());}}builder.endObject();} catch(IOException e){抛出新的DocumentBuilderException(将项目移至新的父项时生成错误的请求!",e);}bulkRequest.add(destinationConnection.getClient().prepareIndex(destinationIndex,destinationType,(String)doc.get("id")).setSource(builder).request());}//尝试使用setRefresh和不使用setRefreshBulkResponse响应= bulkRequest.setRefresh(true).execute().actionGet();for(BulkItemResponse itemResponse:response.getItems()){if(itemResponse.isFailed()){LOG.error(更新项目:{}失败:{}",itemResponse.getFailure().getId(),itemResponse.getFailureMessage());}} 

更新
可以是刷新/查询速度吗?

该程序设置为处理5000个文档批,并且不使用滚动查询,因此我希望从该查询返回的结果总数每次迭代减少5000.

实际上这没有发生.每次迭代从总结果集中删除的文档数量不断减少,直到最终每次迭代都相同为止.

  10:43:42.220 INFO:正在提取另一批10:43:51.701 INFO:找到9260992个匹配的文件.正在处理5000 ...10:43:51.794信息:总计剩余:926099210:43:51.813信息:编写5000个项目的批处理10:43:57.261 INFO:正在提取另一批10:44:06.136 INFO:找到9258661匹配的文件.正在处理5000 ...10:44:06.154信息:总计剩余:925866110:44:06.158 INFO:编写5000个项目的批处理10:44:11.369 INFO:获取另一批10:44:19.790 INFO:找到9256813匹配的文档.正在处理5000 ...10:44:19.804 INFO:总计总计:925681310:44:19.807 INFO:编写5000个项目的批处理10:44:22.684 INFO:正在提取另一批10:44:31.182 INFO:找到9255697个匹配的文档.正在处理5000 ...10:44:31.193信息:总计剩余:925569710:44:31.196信息:编写5000个项目的批处理10:44:33.852 INFO:正在提取另一批10:44:42.394 INFO:找到9255115匹配的文档.正在处理5000 ...10:44:42.406信息:总计剩余:925511510:44:42.409 INFO:编写5000个项目的批处理10:44:45.152 INFO:获取另一批10:44:51.473 INFO:找到9254744个匹配的文档.正在处理5000 ...10:44:51.483 INFO:总计剩余:925474410:44:51.486 INFO:编写5000个项目的批处理10:44:53.853信息:获取另一批10:44:59.966 INFO:找到9255551个匹配的文件.正在处理5000 ...10:44:59.978信息:总计剩余:925555110:44:59.981 INFO:编写5000个项目的批处理10:45:02.446信息:正在提取另一批10:45:07.773信息:找到9254445匹配的文件.正在处理5000 ...10:45:07.787信息:总计剩余:925444510:45:07.791 INFO:编写5000个项目的批处理10:45:10.237 INFO:获取另一批10:45:15.679 INFO:找到9254384匹配的文档.正在处理5000 ...10:45:15.703 INFO:剩余总计:925438410:45:15.712信息:编写5000个项目的批处理10:45:18.078 INFO:正在提取另一批10:45:23.660 INFO:找到9254359匹配的文档.正在处理5000 ...10:45:23.712信息:总计剩余:925435910:45:23.725 INFO:编写5000个项目的批处理10:45:26.520 INFO:正在提取另一批10:45:31.895 INFO:找到9254343匹配的文档.正在处理5000 ...10:45:31.905 INFO:总计剩余:925434310:45:31.908 INFO:编写5000个项目的批处理10:45:34.279信息:正在提取另一批10:45:40.121 INFO:找到9254333匹配的文档.正在处理5000 ...10:45:40.136 INFO:总计剩余:925433310:45:40.139 INFO:写入5000个项目的批处理10:45:42.381 INFO:正在提取另一批10:45:47.798信息:找到9254325匹配的文件.正在处理5000 ...10:45:47.823信息:总计剩余:925432510:45:47.833信息:编写5000个项目的批处理10:45:50.370 INFO:获取另一批10:45:57.105 INFO:找到9254321匹配的文档.正在处理5000 ...10:45:57.117信息:总计剩余:925432110:45:57.121 INFO:写入批处理5000个项目10:45:59.459信息:正在提取另一批 

从一开始看起来文件复制就很普遍.

我刚刚尝试了一个具有群集健康状态的两节点群集:绿色,并且发生了同样的事情.

接下来,我将尝试不复制的单个节点.

更新:
这是批量处理器侦听器数据之前/之后的示例:

之前:

  Item(id = http://static.photosite.com/20160123_093502.jpg,index =状态,type =状态,op_type = INDEX,版本= -3,父=空,路由=空) 

之后(BulkResponse指示没有失败):

  Item(id = http://static.photosite.com/20160123_093502.jpg,index =状态,type =状态,op_type = index,版本= 22) 

注意事项:

  1. 没有父母
  2. 没有路由
  3. 文档版本中的大量跳转

此代码段也没有明确指出,afterBulk请求详细信息中,beforeBulk请求中的每个项目都表示为成功的IndexRequest(即:没有遗漏).

更新2

我认为最初的否定版本可能与它有关:解决方案

执行摘要:
我是个白痴.

详细信息:
我今天开始学习
elasticsearch如何将文档路由到分片.

事实证明,它使用以下论坛:分片=哈希(路由)%number_of_primary_shards

默认情况下, routing 是文档的 _id ,除非您在建立索引时将其覆盖.

每个人都在提到我正在做路由,但是我坚持不这样做. 这就是问题所在!

我已经恢复了数据快照.我尝试升级的索引中的数据最初是由名为 stormcrawler 的程序编写的./p>

stormcrawler 确实使用路由为这些文档建立索引,但是由于我没有使用路由来重新索引这些文档,因此它在不同的分片上创建了明显的重复项.

再次,elasticsearch规则,我很烂.

对我浪费时间的每个人表示抱歉.我现在要躺在黑暗的房间里哭泣.

I'm currently attempting to migrate our elasticsearch data to being 2.0 compatible (ie: no dots in field names) in preparation for an upgrade form 1.x to 2.x.

I've written a program that runs through the data (in batches) that sits in a one-node cluster, and renames the fields, re-indexing the documents using the Bulk API.

At some point it all goes wrong, and the total number of documents coming back from my query (to be "ugpraded") doesn't change, even though it should be counting down.

Initially I thought that it wasn't working. When I pick a document and query for it to see if it's changing, I can see that it is working.

However, when I query documents for a specific field within that document I get two results with the same ID. One of the results has the upgraded field, the other one does not.

On further inspection I can see that they come from different shards:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 19.059433,
    "hits" : [ {
      "_shard" : 0,
      "_node" : "FxbpjCyQRzKfA9QvBbSsmA",
      "_index" : "status",
      "_type" : "status",
      "_id" : "http://static.photosite.com/80018335.jpg",
      "_version" : 2,
      "_score" : 19.059433,
      "_source":{"url":"http://static.photosite.com/80018335.jpg","metadata":{"url.path":["http://www.photosite.com/80018335"],"source":["http://www.photosite.com/80018335"],"longitude":["104.507755"],"latitude":["21.601669"]}},
      ...
    }, {
      "_shard" : 3,
      "_node" : "FxbpjCyQRzKfA9QvBbSsmA",
      "_index" : "status",
      "_type" : "status",
      "_id" : "http://static.photosite.com/80018335.jpg",
      "_version" : 27,
      "_score" : 17.607681,
      "_source":{"url":"http://static.photosite.com/80018335.jpg","metadata":{"url_path":["http://www.photosite.com/80018335"],"source":["http://www.photosite.com/80018335"],"longitude":["104.507755"],"latitude":["21.601669"]}},
      ...      
  }
}

How can I prevent this from happening?

elasticsearch version: 1.7.3

query:

{
  "bool" : {
    "must" : {
      "wildcard" : {
        "metadata.url.path" : "*"
      }
    },
    "must_not" : {
      "wildcard" : {
        "metadata.url_path" : "*"
      }
    }
  }
}

Code to write the document:

        BulkRequestBuilder bulkRequest = destinationConnection.getClient().prepareBulk();
        for(Map<String, Object> doc : batch.getDocs()){
            XContentBuilder builder;
            try {
                builder = XContentFactory.jsonBuilder().startObject();
                for(Map.Entry<String, Object> mapEntry : doc.entrySet()){
                    if(!mapEntry.getKey().equals("id")){
                        builder.field(mapEntry.getKey(), mapEntry.getValue());
                    }
                }
                builder.endObject();
            } catch (IOException e) {
                throw new DocumentBuilderException("Error building request to move items to new parent!", e);
            }

            bulkRequest.add(destinationConnection.getClient().prepareIndex(destinationIndex, destinationType, (String) doc.get("id")).setSource(builder).request());

        }
        // Tried with and without setRefresh
        BulkResponse response = bulkRequest.setRefresh(true).execute().actionGet();
        for(BulkItemResponse itemResponse : response.getItems()){
            if(itemResponse.isFailed()){
                LOG.error("Updating item: {} failed: {}", itemResponse.getFailure().getId(), itemResponse.getFailureMessage());
            }
        }

Update
Could it be refresh/query speed?

The program is set to process 5000 document-batches, and is not using a scroll query, so I'd be expecting the total number of results coming back from that query to be reduced by 5000 every iteration.

In actual fact this is not happening. The amount of documents removed from the total result set each iteration reduces and reduces until eventually it's the same every iteration:

10:43:42.220  INFO : Fetching another batch
10:43:51.701  INFO : Found 9260992 matching documents. Processing 5000...
10:43:51.794  INFO : Total remaining: 9260992
10:43:51.813  INFO : Writing batch of 5000 items
10:43:57.261  INFO : Fetching another batch
10:44:06.136  INFO : Found 9258661 matching documents. Processing 5000...
10:44:06.154  INFO : Total remaining: 9258661
10:44:06.158  INFO : Writing batch of 5000 items
10:44:11.369  INFO : Fetching another batch
10:44:19.790  INFO : Found 9256813 matching documents. Processing 5000...
10:44:19.804  INFO : Total remaining: 9256813
10:44:19.807  INFO : Writing batch of 5000 items
10:44:22.684  INFO : Fetching another batch
10:44:31.182  INFO : Found 9255697 matching documents. Processing 5000...
10:44:31.193  INFO : Total remaining: 9255697
10:44:31.196  INFO : Writing batch of 5000 items
10:44:33.852  INFO : Fetching another batch
10:44:42.394  INFO : Found 9255115 matching documents. Processing 5000...
10:44:42.406  INFO : Total remaining: 9255115
10:44:42.409  INFO : Writing batch of 5000 items
10:44:45.152  INFO : Fetching another batch
10:44:51.473  INFO : Found 9254744 matching documents. Processing 5000...
10:44:51.483  INFO : Total remaining: 9254744
10:44:51.486  INFO : Writing batch of 5000 items
10:44:53.853  INFO : Fetching another batch
10:44:59.966  INFO : Found 9254551 matching documents. Processing 5000...
10:44:59.978  INFO : Total remaining: 9254551
10:44:59.981  INFO : Writing batch of 5000 items
10:45:02.446  INFO : Fetching another batch
10:45:07.773  INFO : Found 9254445 matching documents. Processing 5000...
10:45:07.787  INFO : Total remaining: 9254445
10:45:07.791  INFO : Writing batch of 5000 items
10:45:10.237  INFO : Fetching another batch
10:45:15.679  INFO : Found 9254384 matching documents. Processing 5000...
10:45:15.703  INFO : Total remaining: 9254384
10:45:15.712  INFO : Writing batch of 5000 items
10:45:18.078  INFO : Fetching another batch
10:45:23.660  INFO : Found 9254359 matching documents. Processing 5000...
10:45:23.712  INFO : Total remaining: 9254359
10:45:23.725  INFO : Writing batch of 5000 items
10:45:26.520  INFO : Fetching another batch
10:45:31.895  INFO : Found 9254343 matching documents. Processing 5000...
10:45:31.905  INFO : Total remaining: 9254343
10:45:31.908  INFO : Writing batch of 5000 items
10:45:34.279  INFO : Fetching another batch
10:45:40.121  INFO : Found 9254333 matching documents. Processing 5000...
10:45:40.136  INFO : Total remaining: 9254333
10:45:40.139  INFO : Writing batch of 5000 items
10:45:42.381  INFO : Fetching another batch
10:45:47.798  INFO : Found 9254325 matching documents. Processing 5000...
10:45:47.823  INFO : Total remaining: 9254325
10:45:47.833  INFO : Writing batch of 5000 items
10:45:50.370  INFO : Fetching another batch
10:45:57.105  INFO : Found 9254321 matching documents. Processing 5000...
10:45:57.117  INFO : Total remaining: 9254321
10:45:57.121  INFO : Writing batch of 5000 items
10:45:59.459  INFO : Fetching another batch

It looks as though document duplication is rife from the outset.

I've just tried a two-node cluster with cluster health status: green, and the same thing happens.

I'm going to try a single node with no replication next.

Update:
Here is an example before/after bulk processor listener data:

Before:

Item( id=http://static.photosite.com/20160123_093502.jpg, index=status, type=status, op_type=INDEX, version=-3, parent=null, routing=null )

After (BulkResponse indicated no failures):

Item( id=http://static.photosite.com/20160123_093502.jpg, index=status, type=status, op_type=index, version=22)

Things of note:

  1. No parent
  2. No routing
  3. Massive jump in document version

Also not made clear by this snippet is that each item in the beforeBulk request is represented as a successful IndexRequest in the afterBulk request details (ie: none are missing).

Update 2

I think that the initial negative version might have something to do with it: https://discuss.elastic.co/t/negative-version-number-on-snapshot-restore-from-s3-bucket/56642

Update 3

I've just discovered that when I query the documents using curl the versions are positive, ie:

  1. Restore snapshot.
  2. Query for document using curl, version is 2
  3. Query for document using java API, version is -1
  4. Re-indexing the document causes a duplicate (new document with the same ID written to a different shard) with a version of 1.

What's happening here?

解决方案

Executive Summary:
I am an idiot.

Details:
I started today by learning how elasticsearch routes documents to shards.

It turns out that it uses the following forumula: shard = hash(routing) % number_of_primary_shards

By default, routing is the _id of the document, unless you override that when indexing.

Everyone was mentioning that I was doing routing, but I was adamant that I was not. And that was the problem!!!

I had restored a snapshot of data. The data in the index I was attempting to upgrade was originally written by a program called stormcrawler.

stormcrawler does use routing to index these documents, but because I wasn't using routing to re-index them, it was creating apparent duplicates on different shards.

Once again, elasticsearch rules and I suck.

Sorry for everyone whose time I wasted on this. I'm now going to lie down in a dark room and cry.

这篇关于elasticsearch允许重复的ID具有不同的身体数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆