更新数以百万计的文档的嵌套字段 [英] Update nested field for millions of documents
问题描述
我使用批量更新脚本来更新嵌套字段,但这很慢:
POST index/type/_bulk
{"update":{"_id":"1"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"1","field2":"2"}}}}
{"update":{"_id":"2"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"3","field2":"4"}}}}
... [a lot more splitted in several batches]
您知道另一种可能更快的方法吗?
为了不对每次更新重复执行脚本,似乎可以存储该脚本,但是我找不到保持动态"参数的方法.
与性能优化问题一样,由于存在许多可能导致性能不佳的原因,因此没有单一答案.
在您的情况下,您正在进行批量update
请求.执行update
时,文档实际上是重新索引:
...更新文档就是检索文档,对其进行更改,然后为整个文档重新编制索引.
因此,有必要看一下索引刷新间隔.
您还可以考虑使用支持并行批量请求的现成客户端,例如 Python elasticsearch客户端.
理想的方法是监视ElasticSearch性能指标以了解瓶颈在哪里,以及您的性能调整是否能带来实际的收益. 此处是有关ElasticSearch性能指标的概述博客文章. >
I use bulk update with script in order to update a nested field, but this is very slow :
POST index/type/_bulk
{"update":{"_id":"1"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"1","field2":"2"}}}}
{"update":{"_id":"2"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"3","field2":"4"}}}}
... [a lot more splitted in several batches]
Do you know another way that could be faster ?
It seems possible to store the script in order to not repeat it for each update, but I couldn't find a way to keep "dynamic" params.
As often with performance optimization questions, there is no single answer since there are many possible causes of poor performance.
In your case you are making bulk update
requests. When an update
is performed, the document is actually being re-indexed:
... to update a document is to retrieve it, change it, and then reindex the whole document.
Hence it makes sense to take a look at indexing performance tuning tips. The first few things I would consider in your case would be selecting right bulk size, using several threads for bulk requests and increasing/disabling indexing refresh interval.
You might also consider using a ready-made client that supports parallel bulk requests, like Python elasticsearch client does.
It would be ideal to monitor ElasticSearch performance metrics to understand where the bottleneck is, and if your performance tweaks are giving actual gain. Here is an overview blog post about ElasticSearch performance metrics.
这篇关于更新数以百万计的文档的嵌套字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!