应用“标签”到数百万个文档,使用批量/更新方法 [英] Applying "tag" to millions of documents, using bulk/update methods

查看:134
本文介绍了应用“标签”到数百万个文档,使用批量/更新方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在ElasticSearch实例中有大约55.000.000个文档。我们有一个带有user_ids的CSV文件,最大的CSV有9M条目。我们的文档以user_id为关键字,这样很方便。



我发表了这个问题,因为我想讨论并有最好的选择来完成这个工作,因为有不同的方法来解决这个问题。如果用户文档没有,则需要添加新的标签,例如用stackoverflow或github标记用户。


  1. 有经典的 partial update 端点。这听起来很慢,因为我们需要迭代超过9M的user_ids并为其中的每一个发出api调用。

  2. 批量请求,这提供了一些更好的性能,但有限1000-5000个可以在一次通话中提及的文件。而且知道批次太大时,我们知道我们需要随时随地学习。

  3. 然后是正式开放问题 / update_by_query 端点具有很多流量,但没有确认它在标准版本。

  4. 在这个开放的问题上,提到了一个 update_by_query插件应该提供一些更好的处理方式,但是有一些旧的和开放的问题,用户抱怨性能问题和内存问题。

  5. 我不确定它在EL上是可行的,但是我以为我会将所有的CSV条目加载到一个单独的索引中,不知何故将加​​入这两个索引并应用脚本来添加标签,如果还不存在。

所以问题仍然是最好的方法,如果你们中有一些在过去这样做,确保你分享你的数字/性能以及你这次做的不一样。

解决方案

使用上述逐个查询插件,您只需调用:

  curl -XPOST localhost:9200 / index / type / _update_by_query -d'{
query {filter:{filter:{
not:{term:{tag:github}}
}}},
script ctx._source.label = \github\
}'

更新的查询插件只接受一个脚本而不是部分文档。



至于性能和内存问题,我想最好的事情是尝试一下。


We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.

I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".

  1. There is the classic partial update endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.
  2. there is the bulk request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go.
  3. Then there is the official open issue for /update_by_query endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.
  4. On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
  5. I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.

So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.

解决方案

Using the aforementioned update-by-query plugin, you would simply call:

curl -XPOST localhost:9200/index/type/_update_by_query -d '{
    "query": {"filtered": {"filter":{
        "not": {"term": {"tag": "github"}}
    }}},
    "script": "ctx._source.label = \"github\""
}'

The update-by-query plugin only accepts a script, not partial documents.

As for performance and memory issues, I guess the best thing is to give it a try.

这篇关于应用“标签”到数百万个文档,使用批量/更新方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆