ElasticSearch如何查询最长的任务 [英] ElasticSearch how to query longest task

查看:109
本文介绍了ElasticSearch如何查询最长的任务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Elastic Search中具有以下格式的数据:

I have the data in the following format in Elastic Search:

POST slots/slot/1
{
    taskId:1,
    datetime: "2020-05-10T08:45:44",
    status: "START",
}

POST slots/slot/2
{
    taskId:1,
    datetime: "2020-05-10T08:49:54",
    status: "STOP",
}
...

,并想找到一种方法来检索运行时间最长的前3个任务(这意味着任务同时存在START和STOP json对象,并且START/STOP时间之间的差异最长)-我想检索taskId和runningTime(=任务运行了多长时间).

and want to find a way how to retrieve top 3 longest running tasks (which means tasks, for which exists both START and STOP json objects and differrence between it's START/STOP times is the longest one) - I want to retrieve taskId and runningTime (= how long task was running).

是否可以在ElasticSearch中完成此任务?ElasticSearch是否适合此类任务?

Is it possible to achieve this task in ElasticSearch ? Is ElasticSearch appropriate for such kind of tasks ?

请宽容,我真的是ElasticSearch技术的新手.

Please be lenient, I am really new to ElasticSearch technology.

推荐答案

这很棘手.假设您将为每个唯一 taskId 分别拥有精确 2个文档,其中一个文档为 START 另一个 STOP .在这种情况下,我们可以执行以下操作:

This one's tricky. Let's assume that you'll have precisely 2 docs for each unique taskId, one of which will be START and the other STOP. In that case we can do the following:

GET slots/_search
{
  "size": 0,
  "aggs": {
    "by_ids": {
      "terms": {
        "field": "taskId",
        "size": 10000,
        "min_doc_count": 2
      },
      "aggs": {
        "start_bucket": {
          "filter": {
            "term": {
              "status.keyword": "START"
            }
          },
          "aggs": {
            "datetime_term": {
              "max": {
                "field": "datetime"
              }
            }
          }
        },
        "stop_bucket": {
          "filter": {
            "term": {
              "status.keyword": "STOP"
            }
          },
          "aggs": {
            "datetime_term": {
              "max": {
                "field": "datetime"
              }
            }
          }
        },
        "diff_in_millis": {
          "bucket_script": {
            "buckets_path": {
              "start": "start_bucket.datetime_term",
              "stop": "stop_bucket.datetime_term"
            },
            "script": "return params.stop - params.start"
          }
        },
        "final_sort": {
          "bucket_sort": {
            "sort": [
              {
                "diff_in_millis": {
                  "order": "desc"
                }
              }
            ],
            "size": 3
          }
        }
      }
    }
  }
}

根据此讨论

需要注意的是,这将对最终的存储桶列表进行排序.因此,如果不在列表中,则不会对其进行排序.这与对术语agg本身进行排序形成对比,后者会更改列表的内容.

the caveat is that this performs sorting on the final list of buckets. So if a term isn't in the list, it won't get sorted. That's in contrast to sorting on the terms agg itself, which changes the contents of the list.

换句话说,我们需要将顶级 size 设置为任意高,以便我们所有的 taskIDs 得以汇总.和/或使用例如仅2020年或最后一个月的日期过滤器等对上下文进行预过滤,这样我们就没有足够的基础来解决并节省一些CPU紧缩时间.

In other words we need to set the top-level size arbitrarily high so that all our taskIDs get aggregated. And/or pre-filter the context with, say, a date filter of only the year 2020 or the last month etc so we've got less ground to cover and save us some CPU crunch time.

如果一切正常,并且您的状态具有 .keyword 字段(有关此

If everything goes right and your status has a .keyword field (more on this here) we can filter on, you'll end up with all the information you need:

{
  ...
  "aggregations":{
    "by_ids":{
      "doc_count_error_upper_bound":0,
      "sum_other_doc_count":0,
      "buckets":[
        {
          "key":2,            <-- taskID (this one was added by myself)
          "doc_count":2,
          "start_bucket":{
            ...
          },
          "stop_bucket":{
            ...
          },
          "diff_in_millis":{
            "value":3850000.0        <-- duration in millis
          }
        },
        {
          "key":1,                  <-- task from the question
          "doc_count":2,
          "start_bucket":{
            ...
          },
          "stop_bucket":{
           ...
          },
          "diff_in_millis":{
            "value":250000.0        <-- duration in millis
          }
        }
      ]
    }
  }
}

编辑/更正:

"min_doc_count":2 b/c我们只对实际完成的任务感兴趣.如果要包括已经运行但尚未完成的任务,请创建另一个赏金任务;)

"min_doc_count": 2 is needed b/c we're only interested in tasks that actually finished. If you want to include those that have been running and are not finished yet, create another bounty task ;)

这篇关于ElasticSearch如何查询最长的任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆