Elasticsearch对排序的汇总结果进行分页 [英] Elasticsearch paginating a sorted, aggregated result

查看:148
本文介绍了Elasticsearch对排序的汇总结果进行分页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,在Elasticsearch中无法执行以下操作:

As far as I'm aware, there isn't a way to do something like the following in Elasticsearch:

SELECT * FROM myindex
GROUP BY agg_field1, agg_field2, agg_field3 // aggregation
ORDER BY order_field1, order_field2, order_field3 // sort
LIMIT 1000, 5000 // paginate -- get page 6 of size 1000 records

以下是与此相关的一些文件:

Here are some related documents regarding this:

  • https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-terms-aggregation.html
  • https://discuss.elastic.co/t/elasticsearch-aggregation-order-by-top-hit-score-with-partitions/102228
  • https://github.com/elastic/elasticsearch/issues/21487

在Elasticsearch中是否可以执行上述操作?我们有一个限制,就是我们的记录永远不会超过1000万,因此(希望)我们不会遇到内存错误。我的想法是按以下方式进行操作:

Is there a way to do the above in Elasticsearch? The one limitation we have is we will never have more than 10M records, so we (hopefully) shouldn't run into memory errors. My thinking was to do it as follows:


  • 执行聚合查询

  • 获取数量来自它的结果

  • 根据我们想要的结果和页面大小将其分成N个分段

  • 使用上述分段重新运行查询

  • Do an aggregation query
  • Get the number of results from it
  • Split it into N segments based on the results and page size we want
  • Rerun the query with the above segments

什么是最好的方法?在您的回答/建议中,能否请您发布一些示例代码,以说明如何在ES中完成上述SQL查询?

What would be the best way to accomplish this? In your answer/suggestion, could you please post some sample code relating to how the above SQL query could be done in ES?

对此问题的更新,以下是一个公共索引,可用于测试:

As an update to this question, here is a public index to test with:

# 5.6
e=Elasticsearch('https://search-testinges-fekocjpedql2f3rneuagyukvy4.us-west-1.es.amazonaws.com')
e.search('testindex')

# 6.4 (same data as above)
e = Elasticsearch('https://search-testinges6-fycj5kjd7l5uyo6npycuashch4.us-west-1.es.amazonaws.com')
e.search('testindex6')

它有10,000条记录。随时进行测试:

It has 10,000 records. Feel free to test with it:

< img src = https://i.stack.imgur.com/oLhTU.png alt =在此处输入图片描述>

查询我要执行的操作如下(在sql中):

The query that I'm looking to do is as follows (in sql):

SELECT * FROM testindex
GROUP BY store_url, status, title
ORDER BY title ASC, status DESC
LIMIT 100 OFFSET 6000

换句话说,我正在寻找一个汇总结果(具有多个汇总)并获得偏移量。

In other words, I'm looking to sort an aggregated result (with multiple aggregations) and get an offset.

推荐答案

复合聚合在这里可能会有所帮助,因为它允许您按多个字段分组,然后对结果进行分页。唯一不允许您执行的操作是跳到给定的偏移量,但是您可以通过在必要时从客户端代码进行迭代来做到这一点。

The composite aggregation might help here as it allows you to group by multiple fields and then paginate over the results. The only thing that it doesn't let you do is to jump at a given offset, but you can do that by iterating from your client code if at all necessary.

因此,下面是一个执行此操作的示例查询:

So here is a sample query to do that:

POST testindex6/_search
{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "store": {
              "terms": {
                "field": "store_url"
              }
            }
          },
          {
            "status": {
              "terms": {
                "field": "status",
                "order": "desc"
              }
            }
          },
          {
            "title": {
              "terms": {
                "field": "title",
                "order": "asc"
              }
            }
          }
        ]
      },
      "aggs": {
        "hits": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

在响应中,您将看到 after_key 结构:

In the response you'll see and after_key structure:

  "after_key": {
    "store": "http://google.com1087",
    "status": "OK1087",
    "title": "Titanic1087"
  },

这是一种游标,需要在后续查询中使用,例如: / p>

It's some kind of cursor that you need to use in your subsequent queries, like this:

{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "store": {
              "terms": {
                "field": "store_url"
              }
            }
          },
          {
            "status": {
              "terms": {
                "field": "status",
                "order": "desc"
              }
            }
          },
          {
            "title": {
              "terms": {
                "field": "title",
                "order": "asc"
              }
            }
          }
        ],
        "after": {
          "store": "http://google.com1087",
          "status": "OK1087",
          "title": "Titanic1087"
        }
      },
      "aggs": {
        "hits": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

它将为您提供下100个存储桶。希望这会有所帮助。

And it will give you the next 100 buckets. Hopefully this helps.

更新

如果您想知道多少总会有桶,复合聚合不会提供该数字。但是,由于 composite 聚合只不过是其源中所有字段的笛卡尔积,因此您还可以通过返回[]基数来获得该总数的近似值]( https:// 复合) c>聚合并将它们相乘。

If you want to know how many buckets in total there is going to be, the composite aggregation won't give you that number. However, since the composite aggregation is nothing else than a cartesian product of all the fields in its sources, you can get a good approximation of that total number by also returning the ]cardinality](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html) of each field used in the composite aggregation and multiplying them together.

  "aggs": {
    "my_buckets": {
      "composite": {
        ...
      }
    },
    "store_cardinality": {
      "cardinality": {
        "field": "store_url"
      }
    },
    "status_cardinality": {
      "cardinality": {
        "field": "status"
      }
    },
    "title_cardinality": {
      "cardinality": {
        "field": "title"
      }
    }
  }

然后我们可以通过将 store_cardinality status_cardinality code>和 title_cardinality 一起,或至少很好的近似值(在高基数字段上效果不佳,但在低基数字段上效果很好)。

We can then get the total number of buckets by multiplying the figure we get in store_cardinality, status_cardinality and title_cardinality together, or at least a good approximation thereof (it won't work well on high-cardinality fields, but pretty well on low-cardinality ones).

这篇关于Elasticsearch对排序的汇总结果进行分页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆