从具有唯一ID的组中返回具有最新时间戳的日志 [英] Returning logs with the latest timestamp out of groups with unique IDs

查看:43
本文介绍了从具有唯一ID的组中返回具有最新时间戳的日志的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在Elasticsearch中有几组日志,每组包含1-7个日志,它们共享一个唯一的ID(名为transactionId).每个组中的每个日志都有唯一的时间戳记(eventTimestamp).

We have groups of logs in our Elasticsearch, each group contains 1-7 logs that share a unique ID (named transactionId). Every log in every group has a unique timestamp (eventTimestamp).

例如:

{
  "transactionId": "id111",
  "eventTimestamp": "1505864112047",
  "otherfieldA": "fieldAvalue",
  "otherfieldB": "fieldBvalue"
}

{
  "transactionId": "id111",
  "eventTimestamp": "1505864112051",
  "otherfieldA": "fieldAvalue",
  "otherfieldB": "fieldBvalue"
}

{
  "transactionId": "id222",
  "eventTimestamp": "1505863719467",
  "otherfieldA": "fieldAvalue",
  "otherfieldB": "fieldBvalue"
}

{
  "transactionId": "id222",
  "eventTimestamp": "1505863719478",
  "otherfieldA": "fieldAvalue",
  "otherfieldB": "fieldBvalue"
}

我需要编写一个查询,以返回特定日期范围内所有transactionId的所有最新时间戳.

I need to write a query that returns all of the latest timestamps for all of the transactionIds in a certain date range.

继续我的简单示例,查询结果应返回以下日志:

Continuing with my simplistic example, the result of the query should return these logs:

{
  "transactionId": "id111",
  "eventTimestamp": "1505864112051",
  "otherfieldA": "fieldAvalue",
  "otherfieldB": "fieldBvalue"
}

{
  "transactionId": "id222",
  "eventTimestamp": "1505863719478",
  "otherfieldA": "fieldAvalue",
  "otherfieldB": "fieldBvalue"
}

关于如何构建可实现此目的的查询的任何想法?

Any ideas on how to build a query that accomplishes this?

推荐答案

您可以通过查询

You can get the desired result not with a query itself but with a combination of a terms aggregation and a nested top hits aggregation.

术语聚合负责构建具有相同术语的所有项目都在同一存储桶中的存储桶.这可以根据 transactionId 生成您的组.然后,顶部匹配集合是一种度量标准聚合,可以将其配置为根据给定的排序顺序返回存储区的x个顶部匹配.这样,您就可以检索每个存储桶中具有最大时间戳的日志事件.

The terms aggregation is responsible for building buckets where all items with the same term are in the same bucket. This is what can generate your groups per transactionId. The top hits aggregation then is a metric aggregation that can be configured to return the x top hits of a bucket according to a given sort order. This allows you to retrieve the log event with the largest timestamp of each bucket.

假定示例数据的默认映射(其中字符串被索引为关键字(文本)和thekey.keyword(未分析的文本))此查询:

Assuming the default mapping of your sample data (where strings are indexed as thekey (text) and thekey.keyword (as non-analyzed text)) this query:

GET so-logs/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "eventTimestamp.keyword": {
              "gte": 1500000000000,
              "lte": 1507000000000
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "by_transaction_id": {
      "terms": {
        "field": "transactionId.keyword",
        "size": 10
      },
      "aggs": {
        "latest": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "eventTimestamp.keyword": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

将产生以下输出:

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "by_transaction_id": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "id111",
          "doc_count": 2,
          "latest": {
            "hits": {
              "total": 2,
              "max_score": null,
              "hits": [
                {
                  "_index": "so-logs",
                  "_type": "entry",
                  "_id": "AV6z9Yj4QYbhNp_FoXa1",
                  "_score": null,
                  "_source": {
                    "transactionId": "id111",
                    "eventTimestamp": "1505864112051",
                    "otherfieldA": "fieldAvalue",
                    "otherfieldB": "fieldBvalue"
                  },
                  "sort": [
                    "1505864112051"
                  ]
                }
              ]
            }
          }
        },
        {
          "key": "id222",
          "doc_count": 2,
          "latest": {
            "hits": {
              "total": 2,
              "max_score": null,
              "hits": [
                {
                  "_index": "so-logs",
                  "_type": "entry",
                  "_id": "AV6z9ZlOQYbhNp_FoXa4",
                  "_score": null,
                  "_source": {
                    "transactionId": "id222",
                    "eventTimestamp": "1505863719478",
                    "otherfieldA": "fieldAvalue",
                    "otherfieldB": "fieldBvalue"
                  },
                  "sort": [
                    "1505863719478"
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}

根据查询中定义的聚合名称,您可以在聚合结果 by_transaction_id.latest 中找到所需的结果.

where you can find the desired results inside of the aggregation results by_transaction_id.latest according to the aggregation names defined in the query.

请注意,术语聚合对返回的存储桶有限制,从性能的角度来看,将其设置为大于10.000可能不是一个聪明的主意.有关详细信息,请参见条款聚合的 size 部分.如果您想处理大量不同的交易ID,建议您按交易ID对顶部"条目进行冗余存储.

Please be aware that the terms aggregation has a limit on how many buckets are returned and setting this to say >10.000 is probably not a clever idea from a performance perspective. For details, see the section on size of the terms aggregation. If you want to deal with huge amounts of different transaction ids I would suggest to do some redundant storage of the "top" entry by transaction id.

此外,您可能应该将 eventTimestamp 字段切换为 date ,以获得更好的性能和

In addition, you should probably switch the eventTimestamp field to date for better performance and a wider set of query possibilities.

这篇关于从具有唯一ID的组中返回具有最新时间戳的日志的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆