Elasticsearch 对排序的聚合结果进行分页 [英] Elasticsearch paginating a sorted, aggregated result

查看:37
本文介绍了Elasticsearch 对排序的聚合结果进行分页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,没有办法在 Elasticsearch 中执行以下操作:

SELECT * FROM myindexGROUP BY agg_field1, agg_field2, agg_field3//聚合ORDER BY order_field1, order_field2, order_field3//排序LIMIT 1000, 5000//分页 -- 获取大小为 1000 条记录的第 6 页

这里有一些相关的文档:

  • 我要执行的查询如下(在 sql 中):

    SELECT * FROM testindexGROUP BY store_url、状态、标题ORDER BY 标题 ASC,状态 DESC限制 100 偏移 6000

    换句话说,我希望对聚合结果(具有多个聚合)进行排序并获得偏移量.

    解决方案

    compositeaggregation 在这里可能会有所帮助,因为它允许您按多个字段进行分组,然后对结果进行分页.它不允许您做的唯一一件事是在给定的偏移量处跳转,但您可以在必要时通过从客户端代码进行迭代来做到这一点.

    这里是一个示例查询来做到这一点:

    POST testindex6/_search{大小":0,aggs":{我的桶":{合成的": {大小":100,来源":[{店铺": {条款":{字段":store_url"}}},{地位": {条款":{"字段": "状态",顺序":降序"}}},{标题": {条款":{"字段": "标题",顺序":升序"}}}]},aggs":{命中":{top_hits":{大小":100}}}}}}

    在响应中你会看到 after_key 结构:

     "after_key": {"商店": "http://google.com1087","状态": "OK1087","title": "泰坦尼克号1087"},

    这是您需要在后续查询中使用的某种游标,如下所示:

    <代码>{大小":0,aggs":{我的桶":{合成的": {大小":100,来源":[{店铺": {条款":{字段":store_url"}}},{地位": {条款":{"字段": "状态",顺序":降序"}}},{标题": {条款":{"字段": "标题",顺序":升序"}}}],后": {"商店": "http://google.com1087","状态": "OK1087","title": "泰坦尼克号1087"}},aggs":{命中":{top_hits":{大小":100}}}}}}

    它会给你接下来的 100 个桶.希望这会有所帮助.

    更新:

    如果您想知道总共有多少个存储桶,composite 聚合不会给您这个数字.但是,由于 composite 聚合只不过是其源中所有字段的笛卡尔积,因此您还可以通过返回 ]cardinality](https://www.elastic.co/composite 聚合中使用的每个字段的指南/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html) 并将它们相乘.

     "aggs": {我的桶":{合成的": {...}},商店基数":{基数":{字段":store_url"}},状态基数":{基数":{字段":状态"}},title_cardinality":{基数":{字段":标题"}}}

    然后我们可以通过将我们在 store_cardinalitystatus_cardinalitytitle_cardinality 中得到的数字相乘来得到桶的总数,或者至少很好的近似(它在高基数字段上效果不佳,但在低基数字段上效果很好).

    As far as I'm aware, there isn't a way to do something like the following in Elasticsearch:

    SELECT * FROM myindex
    GROUP BY agg_field1, agg_field2, agg_field3 // aggregation
    ORDER BY order_field1, order_field2, order_field3 // sort
    LIMIT 1000, 5000 // paginate -- get page 6 of size 1000 records
    

    Here are some related documents regarding this:

    Is there a way to do the above in Elasticsearch? The one limitation we have is we will never have more than 10M records, so we (hopefully) shouldn't run into memory errors. My thinking was to do it as follows:

    • Do an aggregation query
    • Get the number of results from it
    • Split it into N segments based on the results and page size we want
    • Rerun the query with the above segments

    What would be the best way to accomplish this? In your answer/suggestion, could you please post some sample code relating to how the above SQL query could be done in ES?


    As an update to this question, here is a public index to test with:

    # 5.6
    e=Elasticsearch('https://search-testinges-fekocjpedql2f3rneuagyukvy4.us-west-1.es.amazonaws.com')
    e.search('testindex')
    
    # 6.4 (same data as above)
    e = Elasticsearch('https://search-testinges6-fycj5kjd7l5uyo6npycuashch4.us-west-1.es.amazonaws.com')
    e.search('testindex6')
    

    It has 10,000 records. Feel free to test with it:

    The query that I'm looking to do is as follows (in sql):

    SELECT * FROM testindex
    GROUP BY store_url, status, title
    ORDER BY title ASC, status DESC
    LIMIT 100 OFFSET 6000
    

    In other words, I'm looking to sort an aggregated result (with multiple aggregations) and get an offset.

    解决方案

    The composite aggregation might help here as it allows you to group by multiple fields and then paginate over the results. The only thing that it doesn't let you do is to jump at a given offset, but you can do that by iterating from your client code if at all necessary.

    So here is a sample query to do that:

    POST testindex6/_search
    {
      "size": 0,
      "aggs": {
        "my_buckets": {
          "composite": {
            "size": 100,
            "sources": [
              {
                "store": {
                  "terms": {
                    "field": "store_url"
                  }
                }
              },
              {
                "status": {
                  "terms": {
                    "field": "status",
                    "order": "desc"
                  }
                }
              },
              {
                "title": {
                  "terms": {
                    "field": "title",
                    "order": "asc"
                  }
                }
              }
            ]
          },
          "aggs": {
            "hits": {
              "top_hits": {
                "size": 100
              }
            }
          }
        }
      }
    }
    

    In the response you'll see and after_key structure:

      "after_key": {
        "store": "http://google.com1087",
        "status": "OK1087",
        "title": "Titanic1087"
      },
    

    It's some kind of cursor that you need to use in your subsequent queries, like this:

    {
      "size": 0,
      "aggs": {
        "my_buckets": {
          "composite": {
            "size": 100,
            "sources": [
              {
                "store": {
                  "terms": {
                    "field": "store_url"
                  }
                }
              },
              {
                "status": {
                  "terms": {
                    "field": "status",
                    "order": "desc"
                  }
                }
              },
              {
                "title": {
                  "terms": {
                    "field": "title",
                    "order": "asc"
                  }
                }
              }
            ],
            "after": {
              "store": "http://google.com1087",
              "status": "OK1087",
              "title": "Titanic1087"
            }
          },
          "aggs": {
            "hits": {
              "top_hits": {
                "size": 100
              }
            }
          }
        }
      }
    }
    

    And it will give you the next 100 buckets. Hopefully this helps.

    UPDATE:

    If you want to know how many buckets in total there is going to be, the composite aggregation won't give you that number. However, since the composite aggregation is nothing else than a cartesian product of all the fields in its sources, you can get a good approximation of that total number by also returning the ]cardinality](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html) of each field used in the composite aggregation and multiplying them together.

      "aggs": {
        "my_buckets": {
          "composite": {
            ...
          }
        },
        "store_cardinality": {
          "cardinality": {
            "field": "store_url"
          }
        },
        "status_cardinality": {
          "cardinality": {
            "field": "status"
          }
        },
        "title_cardinality": {
          "cardinality": {
            "field": "title"
          }
        }
      }
    

    We can then get the total number of buckets by multiplying the figure we get in store_cardinality, status_cardinality and title_cardinality together, or at least a good approximation thereof (it won't work well on high-cardinality fields, but pretty well on low-cardinality ones).

    这篇关于Elasticsearch 对排序的聚合结果进行分页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆