如何使用弹性搜索聚合返回唯一文档的数量 [英] how to return the count of unique documents by using elasticsearch aggregation

查看:25
本文介绍了如何使用弹性搜索聚合返回唯一文档的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个问题,elasticsearch 无法仅通过在嵌套字段上使用术语聚合来返回唯一文档的数量.

I encountered a problem that elasticsearch could not return the count of unique documents by just using terms aggregation on a nested field.

以下是我们模型的示例:

Here is an example of our model:

{
    ...,
    "location" : [
        {"city" : "new york", "state" : "ny"},
        {"city" : "woodbury", "state" : "ny"},
        ...
    ],
    ...
}

我想对 state 字段进行聚合,但由于 'ny' 在文档中出现两次,因此该文档将在 'ny' 存储桶中计数两次.

I want to do aggregation on the state field, but this document will be counted twice in the 'ny' bucket since 'ny' appears twice in the document.

所以我想知道是否有办法获取不同文档的数量.

So I'm wondering if where is a way to grab the count of distinct documents.

映射:

people = {
  :properties => {
    :location => {
      :type => 'nested',
      :properties => {
        :city => {
          :type => 'string',
          :index => 'not_analyzed',
        },
        :state => {
          :type => 'string',
          :index => 'not_analyzed',
        },
      }
    },
    :last_name => {
      :type => 'string',
      :index => 'not_analyzed'
    }
  }
}

查询非常简单:

curl -XGET 'http://localhost:9200/people/_search?pretty&search_type=count' -d '{
  "query" : {
    "bool" : {
      "must" : [
        {"term" : {"last_name" : "smith"}}
      ]
    }
  },
  "aggs" : {
    "location" : {
      "nested" : {
        "path" : "location"
      },
      "aggs" : {
        "state" : {
          "terms" : {"field" : "location.state", "size" : 10}
        }
      }
    }
  }
}'

回复:

{
  "took" : 104,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1248513,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "location" : {
      "doc_count" : 2107012,
      "state" : {
        "buckets" : [ {
          "key" : 6,
          "key_as_string" : "6",
          "doc_count" : 214754
        }, {
          "key" : 12,
          "key_as_string" : "12",
          "doc_count" : 168887
        }, {
          "key" : 48,
          "key_as_string" : "48",
          "doc_count" : 101333
        } ]
      }
    }
  }
}

doc_count 远大于命中总数.所以肯定有重复.

The doc_count is much larger than the total in hit. So there must be duplicates.

谢谢!

推荐答案

我认为你需要一个 reverse_nested 聚合,因为你想要基于嵌套值的聚合,但实际上计算的是 ROOT 文档,而不是嵌套的

I think you need a reverse_nested aggregation, because you want aggregation based on a nested value, but actually counting the ROOT documents, not the nested ones

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "last_name": "smith"
          }
        }
      ]
    }
  },
  "aggs": {
    "location": {
      "nested": {
        "path": "location"
      },
      "aggs": {
        "state": {
          "terms": {
            "field": "location.state",
            "size": 10
          },
          "aggs": {
            "top_reverse_nested": {
              "reverse_nested": {}
            }
          }
        }
      }
    }
  }
}

因此,您会看到如下内容:

And, as a result, you would see something like this:

"aggregations": {
      "location": {
         "doc_count": 6,
         "state": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
               {
                  "key": "ny",
                  "doc_count": 4,
                  "top_reverse_nested": {
                     "doc_count": 2
                  }
               },
               {
                  "key": "ca",
                  "doc_count": 2,
                  "top_reverse_nested": {
                     "doc_count": 2
                  }
               }
            ]
         }
      }
   }

您要查找的内容位于 top_reverse_nested 部分下.这里有一点:如果我没有误认为 "doc_count": 6 是 NESTED 文档计数,所以不要对这些数字感到困惑,认为您正在计算根文档,计数是在嵌套的那些.因此,对于具有三个匹配的嵌套文档的文档,计数将为 3,而不是 1.

And what you are looking for is under top_reverse_nested part. One point here: if I'm not mistaking "doc_count": 6 is the NESTED document count, so don't be confused about these numbers thinking you are counting root documents, the count is on the nested ones. So, for a document with three nested ones that match, the count would be 3, not 1.

这篇关于如何使用弹性搜索聚合返回唯一文档的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆