如何通过Elasticsearch中的嵌套字段计算多个唯一文档? [英] How to count a number of unique documents by a nested field in Elasticsearch?

查看:369
本文介绍了如何通过Elasticsearch中的嵌套字段计算多个唯一文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对具有唯一嵌套字段值的文档进行计数(接下来,文档本身也是如此)。看起来获得唯一文档有效。
但是,当我尝试执行 count 的请求时,出现如下错误:

I'm trying to count documents with unique nested field value (and next, the documents itself also). Looks like getting the unique documents works. But when I'm trying to execute a request for count, I'm getting an error as follows:


抑制:org.elasticsearch.client.ResponseException:方法[POST],主机[ http :// localhost:9200] ,URI [/ package / _count?ignore_throttled = true& ignore_unavailable = false& expand_wildcards = open& allow_no_indices = true],状态行[HTTP / 1.1 400错误的请求]
{ error:{ root_cause:[{ type: parsing_exception, reason:请求不支持[collapse], line:1, col:216}], type : parsing_exception,原因:请求不支持[collapse], line:1, col:216}, status:400}

Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [http://localhost:9200], URI [/package/_count?ignore_throttled=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"parsing_exception","reason":"request does not support [collapse]","line":1,"col":216}],"type":"parsing_exception","reason":"request does not support [collapse]","line":1,"col":216},"status":400}

代码:

        BoolQueryBuilder innerTemplNestedBuilder = QueryBuilders.boolQuery();
        NestedQueryBuilder templatesNestedQuery = QueryBuilders.nestedQuery("attachment", innerTemplNestedBuilder, ScoreMode.None);
        BoolQueryBuilder mainQueryBuilder = QueryBuilders.boolQuery().must(templatesNestedQuery);
        if (!isEmpty(templateName)) {
            innerTemplNestedBuilder.filter(QueryBuilders.termQuery("attachment.name", templateName));
        }
        SearchSourceBuilder searchSourceBuilder = SearchSourceBuilder.searchSource()
                    .collapse(new CollapseBuilder("attachment.uuid"))
                    .query(mainQueryBuilder);
    // NEXT LINE CAUSE ERROR
        long count = client.count(new CountRequest("package").source(searchSourceBuilder), RequestOptions.DEFAULT).getCount(); <<<<<<<<<< ERROR HERE
        // THIS WORKS 
        SearchResponse searchResponse = client.search(
                    new SearchRequest(
                            new String[] {"package"},
                            searchSourceBuilder.timeout(new TimeValue(20, TimeUnit.SECONDS)).from(offset).size(limit)
                    ).indices("package").searchType(SearchType.DFS_QUERY_THEN_FETCH),
                    RequestOptions.DEFAULT
        );
        return ....;

此方法的总体目的是获取一部分文档以及所有此类文档的数量。可能已经有另一种方法可以满足这种需求。如果我想使用聚合基数个计数 >-我得到的结果为零,在嵌套字段上似乎无效。

The overall intention of approach is to get a portion of documents and the number of all such documents. May be there is another approach for this need already exists. If I'm trying to get count using aggregations and cardinality - I'm getting the zero result and it looks like it doesn't work on the nested fields.

计数请求:

{
    "query": {
        "bool": {
            "must": [
                {
                    "nested": {
                        "query": {
                            "bool": {
                                "adjust_pure_negative": true,
                                "boost": 1.0
                            }
                        },
                        "path": "attachment",
                        "ignore_unmapped": false,
                        "score_mode": "none",
                        "boost": 1.0
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1.0
        }
    },
    "collapse": {
        "field": "attachment.uuid"
    }
}

如何创建映射:

curl -X DELETE "localhost:9200/package?pretty"
curl -X PUT    "localhost:9200/package?include_type_name=true&pretty" -H 'Content-Type: application/json' -d '{
    "settings" :  {
        "number_of_shards" : 1,
        "number_of_replicas" : 1
    }}'
curl -X PUT    "localhost:9200/package/_mappings?pretty" -H 'Content-Type: application/json' -d'
{
      "dynamic": false,
      "properties" : {
        "attachment": {
            "type": "nested",
            "properties": {
                "uuid" : { "type" : "keyword" },
                "name" : { "type" : "text" }
            }
        },
        "uuid" : {
          "type" : "keyword"
        }
      }
}
'

r由代码生成的结果查询应类似于以下内容:

result query generated by code should be something like this:

curl -X POST "localhost:9200/package/_count?&pretty" -H 'Content-Type: application/json' -d' { "query" :
    {
        "bool": {
            "must": [
                {
                    "nested": {
                        "query": {
                            "bool": {
                                "adjust_pure_negative": true,
                                "boost": 1.0
                            }
                        },
                        "path": "attachment",
                        "ignore_unmapped": false,
                        "score_mode": "none",
                        "boost": 1.0
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1.0
        }
    },
    "collapse": {
        "field": "attachment.uuid"
    }
}'


推荐答案

合拢可以仅在 _search 上下文中使用,而不在 _count 中使用。

Collapsing can only be used in the _search context, not in _count.

第二,您的查询甚至可以做什么?您那里有很多冗余参数,例如 boost:1 等。您不妨说:

Secondly, what does your query even do? You've got a lot of redundant parameters there like boost:1 etc. You might as well say:

POST /package/_count?&pretty
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "attachment",
            "query": {
              "match_all": {}
            }
          }
        }
      ]
    }
  }
}

实际上并没有做任何事情:)

which does not really do anything :)

我们假设3个文档,其中2个具有相同的附件。 uuid 值:

let's imagine 3 documents, 2 of which have the same attachment.uuid value:

[
  {
    "attachment":{
      "uuid":"04144e14-62c3-11ea-bc55-0242ac130003"
    }
  },
  {
    "attachment":{
      "uuid":"04144e14-62c3-11ea-bc55-0242ac130003"
    }
  },
  {
    "attachment":{
      "uuid":"100b9632-62c3-11ea-bc55-0242ac130003"
    }
  }
]

获得条款 uuid 的细分,运行

To get the terms breakdown of the uuids, run

GET package/_search
{
  "size": 0,
  "aggs": {
    "nested_uniques": {
      "nested": {
        "path": "attachment"
      },
      "aggs": {
        "subagg": {
          "terms": {
            "field": "attachment.uuid"
          }
        }
      }
    }
  }
}

产生

...
{
  "aggregations":{
    "nested_uniques":{
      "doc_count":3,
      "subagg":{
        "doc_count_error_upper_bound":0,
        "sum_other_doc_count":0,
        "buckets":[
          {
            "key":"04144e14-62c3-11ea-bc55-0242ac130003",
            "doc_count":2
          },
          {
            "key":"100b9632-62c3-11ea-bc55-0242ac130003",
            "doc_count":1
          }
        ]
      }
    }
  }
}






要获得唯一嵌套字段的父文档数,我们将不得不变得更加聪明:




To get the the parent doc count of unique nested fields, we're gonna have to get slightly more clever:

GET package/_search
{
  "size": 0,
  "aggs": {
    "nested_uniques": {
      "nested": {
        "path": "attachment"
      },
      "aggs": {
        "scripted_uniques": {
          "scripted_metric": {
            "init_script": "state.my_map = [:];",
            "map_script": """
              if (doc.containsKey('attachment.uuid')) {
                state.my_map[doc['attachment.uuid'].value.toString()] = 1;
              }
            """,
            "combine_script": """
              def sum = 0;
              for (c in state.my_map.entrySet()) {
                sum += 1
              }
              return sum
            """,
            "reduce_script": """
              def sum = 0;
              for (agg in states) {
                sum += agg;
              }
              return sum;
            """
          }
        }
      }
    }
  }
}

返回

...
{
  "aggregations":{
    "nested_uniques":{
      "doc_count":3,
      "scripted_uniques":{
        "value":2
      }
    }
  }
}

$ c> scripted_uniques:2 正是您所追求的。

and this scripted_uniques: 2 is exactly what you're after.

注意:我使用嵌套的脚本指标aggs解决了该用例,但是如果您知道有一种更清洁的方法,我非常乐于学习它!

这篇关于如何通过Elasticsearch中的嵌套字段计算多个唯一文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆