更新数以百万计的文档的嵌套字段 [英] Update nested field for millions of documents

查看:105
本文介绍了更新数以百万计的文档的嵌套字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用批量更新脚本来更新嵌套字段,但这很慢:

POST index/type/_bulk

{"update":{"_id":"1"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"1","field2":"2"}}}}
{"update":{"_id":"2"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"3","field2":"4"}}}}

 ... [a lot more splitted in several batches]

您知道另一种可能更快的方法吗?

为了不对每次更新重复执行脚本,似乎可以存储该脚本,但是我找不到保持动态"参数的方法.

解决方案

与性能优化问题一样,由于存在许多可能导致性能不佳的原因,因此没有单一答案.

在您的情况下,您正在进行批量update请求.执行update时,文档实际上是重新索引:

...更新文档就是检索文档,对其进行更改,然后为整个文档重新编制索引.

因此,有必要看一下 Python elasticsearch客户端.

理想的方法是监视ElasticSearch性能指标以了解瓶颈在哪里,以及您的性能调整是否能带来实际的收益. 此处是有关ElasticSearch性能指标的概述博客文章. >

I use bulk update with script in order to update a nested field, but this is very slow :

POST index/type/_bulk

{"update":{"_id":"1"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"1","field2":"2"}}}}
{"update":{"_id":"2"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"3","field2":"4"}}}}

 ... [a lot more splitted in several batches]

Do you know another way that could be faster ?

It seems possible to store the script in order to not repeat it for each update, but I couldn't find a way to keep "dynamic" params.

解决方案

As often with performance optimization questions, there is no single answer since there are many possible causes of poor performance.

In your case you are making bulk update requests. When an update is performed, the document is actually being re-indexed:

... to update a document is to retrieve it, change it, and then reindex the whole document.

Hence it makes sense to take a look at indexing performance tuning tips. The first few things I would consider in your case would be selecting right bulk size, using several threads for bulk requests and increasing/disabling indexing refresh interval.

You might also consider using a ready-made client that supports parallel bulk requests, like Python elasticsearch client does.

It would be ideal to monitor ElasticSearch performance metrics to understand where the bottleneck is, and if your performance tweaks are giving actual gain. Here is an overview blog post about ElasticSearch performance metrics.

这篇关于更新数以百万计的文档的嵌套字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆