如何通过使用弹性搜索聚合返回唯一文档的计数 [英] how to return the count of unique documents by using elasticsearch aggregation

查看:120
本文介绍了如何通过使用弹性搜索聚合返回唯一文档的计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到一个问题,弹性搜索不能通过在嵌套字段上使用术语聚合来返回唯一文档的计数。

I encountered a problem that elasticsearch could not return the count of unique documents by just using terms aggregation on a nested field.

以下是我们的模型示例:

Here is an example of our model:

{
    ...,
    "location" : [
        {"city" : "new york", "state" : "ny"},
        {"city" : "woodbury", "state" : "ny"},
        ...
    ],
    ...
}

我想在状态字段上进行聚合,但是这个文件将在ny桶中被计数两次,因为ny在文档中出现两次。

I want to do aggregation on the state field, but this document will be counted twice in the 'ny' bucket since 'ny' appears twice in the document.

所以我想知道在哪里可以获取不同文件的数量。

So I'm wondering if where is a way to grab the count of distinct documents.

映射:

people = {
  :properties => {
    :location => {
      :type => 'nested',
      :properties => {
        :city => {
          :type => 'string',
          :index => 'not_analyzed',
        },
        :state => {
          :type => 'string',
          :index => 'not_analyzed',
        },
      }
    },
    :last_name => {
      :type => 'string',
      :index => 'not_analyzed'
    }
  }
}

查询是很简单:

curl -XGET 'http://localhost:9200/people/_search?pretty&search_type=count' -d '{
  "query" : {
    "bool" : {
      "must" : [
        {"term" : {"last_name" : "smith"}}
      ]
    }
  },
  "aggs" : {
    "location" : {
      "nested" : {
        "path" : "location"
      },
      "aggs" : {
        "state" : {
          "terms" : {"field" : "location.state", "size" : 10}
        }
      }
    }
  }
}'

回复:

{
  "took" : 104,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1248513,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "location" : {
      "doc_count" : 2107012,
      "state" : {
        "buckets" : [ {
          "key" : 6,
          "key_as_string" : "6",
          "doc_count" : 214754
        }, {
          "key" : 12,
          "key_as_string" : "12",
          "doc_count" : 168887
        }, {
          "key" : 48,
          "key_as_string" : "48",
          "doc_count" : 101333
        } ]
      }
    }
  }
}

doc_count远远大于命中总数。所以必须有重复的。

The doc_count is much larger than the total in hit. So there must be duplicates.

谢谢!

推荐答案

您需要一个 reverse_nested 聚合,因为您希望基于嵌套值进行聚合,但实际计算ROOT文档,而不是嵌套的

I think you need a reverse_nested aggregation, because you want aggregation based on a nested value, but actually counting the ROOT documents, not the nested ones

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "last_name": "smith"
          }
        }
      ]
    }
  },
  "aggs": {
    "location": {
      "nested": {
        "path": "location"
      },
      "aggs": {
        "state": {
          "terms": {
            "field": "location.state",
            "size": 10
          },
          "aggs": {
            "top_reverse_nested": {
              "reverse_nested": {}
            }
          }
        }
      }
    }
  }
}

因此,您会看到类似的内容:

And, as a result, you would see something like this:

"aggregations": {
      "location": {
         "doc_count": 6,
         "state": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
               {
                  "key": "ny",
                  "doc_count": 4,
                  "top_reverse_nested": {
                     "doc_count": 2
                  }
               },
               {
                  "key": "ca",
                  "doc_count": 2,
                  "top_reverse_nested": {
                     "doc_count": 2
                  }
               }
            ]
         }
      }
   }

您正在寻找的是 top_reverse_nested 部分。
这里一点:如果我没有错误doc_count:6 是NESTED文档数,所以不要混淆这些数字,认为你是计数根文档,计数是嵌套的。所以,对于一个三个嵌套的文件匹配的文件,计数将是3,而不是1。

And what you are looking for is under top_reverse_nested part. One point here: if I'm not mistaking "doc_count": 6 is the NESTED document count, so don't be confused about these numbers thinking you are counting root documents, the count is on the nested ones. So, for a document with three nested ones that match, the count would be 3, not 1.

这篇关于如何通过使用弹性搜索聚合返回唯一文档的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆