了解“在查询执行期间超出资源”在BigQuery中使用GROUP EACH BY [英] Understanding "Resources exceeded during query execution" with GROUP EACH BY in BigQuery

查看:133
本文介绍了了解“在查询执行期间超出资源”在BigQuery中使用GROUP EACH BY的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个后台作业来自动处理BigQuery中的A / B测试数据,并且在执行大型GROUP EACH BY语句时发现我正在执行查询执行期间超出资源。我从查询执行过程中超出的资源中看到,减少组数可以使查询成功,所以我把我的数据分成小块,但我仍然遇到错误(尽管不太常见)。如果能够更好地了解实际导致此错误的原因,那将是一件好事。特别是:


  • 资源是否超出总是意味着碎片耗尽内存,或者这也意味着任务耗尽的时间?

  • 什么是正确的方式来近似内存使用情况和我可用的总内存?我是否正确地假定每个碎片跟踪大约1 / n个组,并且保留每个组的组密钥和所有聚合,或者我应该考虑另一种方式?

  • 碎片的数量如何确定?特别是,如果我查询的是较小的数据集,是否会获得更少的分片/资源?

  • 有问题的查询看起来像这样在实践中,它用作子查询,而外部查询汇总结果):

    pre $ SELECT
    alternative,
    snapshot_time,
    SUM(column_1),
    ...
    SUM(column_139)
    FROM
    my_table
    CROSS JOIN
    [包含24个unix时间戳的表]时间戳
    WHERE last_updated_time< timestamps.snapshot_time
    GROUP EACH BY alternative,user_id,snapshot_time

    (下面是一个失败的作业示例:124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4)

    我意识到这个查询可能会有问题,但在这种情况下,表只有22MB,查询结果在一百万组以下超出资源仍然失败。减少一次处理的时间戳数量可以修复错误,但我担心我最终会达到足够大的数据范围,以至于整个方法将停止工作。



    有趣的是你的查询是GROUP EACH是通过一个由于CROSS JOIN中的扩展,表比原始表更多。因此,我们选择了一些太小的查询碎片。



    回答您的具体问题:


    • 超出的资源几乎总是意味着工作人员内存不足。这可能是一个碎片或一个混合器,用Dremel术语来说(混合器是计算树中聚合结果的节点,GROUP EACH BY将聚合推向碎片,这是计算树的叶子)


    • 没有一种方法来估计可用资源的数量。这随着时间的推移而变化,目标是更多的查询应该正常工作。

    • 分片的数量由查询中处理的总字节数决定。正如您已经注意到的,这种启发式方法对于扩展底层数据集的连接并不适用。也就是说,我们正在积极研究如何挑选碎片数量。为了给您一个扩展的想法,您的查询只计划在20个分片上,这只是一张较大的表的一小部分。 b $ b

      作为一种解决方法,您可以将CROSS JOIN的中间结果保存为表,并在该临时表上运行GROUP EACH BY。这应该让BigQuery在选择分片数量时使用扩大后的大小。 (如果这不起作用,请告诉我,我们可能需要调整我们的分配阈值)。


      I'm writing a background job to automatically process A/B test data in BigQuery, and I'm finding that I'm hitting "Resources exceeded during query execution" when doing large GROUP EACH BY statements. I saw from Resources Exceeded during query execution that reducing the number of groups can make queries succeed, so I split up my data into smaller pieces, but I'm still hitting errors (although less frequently). It would be nice to get a better intuition about what actually causes this error. In particular:

      • Does "resources exceeded" always mean that a shard ran out of memory, or could it also mean that the task ran out of time?
      • What's the right way to approximate the memory usage and the total memory I have available? Am I correct in assuming each shard tracks about 1/n of the groups and keeps the group key and all aggregates for each group, or is there another way that I should be thinking about it?
      • How is the number of shards determined? In particular, do I get fewer shards/resources if I'm querying over a smaller dataset?

      The problematic query looks like this (in practice, it's used as a subquery, and the outer query aggregates the results):

      SELECT
          alternative,
          snapshot_time,
          SUM(column_1),
          ...
          SUM(column_139)
      FROM
              my_table
          CROSS JOIN
              [table containing 24 unix timestamps] timestamps
      WHERE last_updated_time < timestamps.snapshot_time
      GROUP EACH BY alternative, user_id, snapshot_time
      

      (Here's an example failed job: 124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4 )

      I realize this query may be asking for trouble, but in this case, the table is only 22MB and the query results in under a million groups and it's still failing with "resources exceeded". Reducing the number of timestamps to process at once fixes the error, but I'm worried that I'll eventually hit a data scale large enough that this approach as a whole will stop working.

      解决方案

      As you've guessed, BigQuery chooses a number of parallel workers (shards) for GROUP EACH and JOIN EACH queries based on the size of the tables being operated upon. It is a rough heuristic, but in practice, it works pretty well.

      What is interesting about your query is that the GROUP EACH is being done over a larger table than the original table because of the expansion in the CROSS JOIN. Because of this, we choose a number of shards that is too small for your query.

      To answer your specific questions:

      • Resources exceeded almost always means that a worker ran out of memory. This could be a shard or a mixer, in Dremel terms (mixers are the nodes in the computation tree that aggregate results. GROUP EACH BY pushes aggregation down to the shards, which are the leaves of the computation tree).

      • There isn't a good way to approximate the amount of resources available. This changes over time, with the goal that more of your queries should just work.

      • The number of shards is determined by the total bytes processed in the query. As you've noticed, this heuristic doesn't work well with joins that expand the underlying data sets. That said, there is active work underway to be smarter about how we pick the number of shards. To give you an idea of scale, your query got scheduled on only 20 shards, which is a tiny fraction of what a larger table would get.

      As a workaround, you could save the intermediate result of the CROSS JOIN as a table, and running the GROUP EACH BY over that temporary table. That should let BigQuery use the expanded size when picking the number of shards. (if that doesn't work, please let me know, it is possible that we need to tweak our assignment thresholds).

      这篇关于了解“在查询执行期间超出资源”在BigQuery中使用GROUP EACH BY的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆