了解“查询执行期间超出的资源"在 BigQuery 中使用 GROUP EACH BY [英] Understanding "Resources exceeded during query execution" with GROUP EACH BY in BigQuery

查看:18
本文介绍了了解“查询执行期间超出的资源"在 BigQuery 中使用 GROUP EACH BY的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个后台作业来自动处理 BigQuery 中的 A/B 测试数据,我发现在执行大型 GROUP EACH BY 语句时遇到了查询执行期间资源超出"的问题.我从 Resources Exceeded during query execution 中看到,减少组数可以使查询成功,所以我将我的数据分成更小的部分,但我仍然遇到错误(虽然不那么频繁).最好能更好地了解导致此错误的原因.特别是:

I'm writing a background job to automatically process A/B test data in BigQuery, and I'm finding that I'm hitting "Resources exceeded during query execution" when doing large GROUP EACH BY statements. I saw from Resources Exceeded during query execution that reducing the number of groups can make queries succeed, so I split up my data into smaller pieces, but I'm still hitting errors (although less frequently). It would be nice to get a better intuition about what actually causes this error. In particular:

  • 资源超出"是否总是意味着某个分片的内存不足,或者是否也意味着该任务的时间用完了?
  • 估算内存使用量和可用内存总量的正确方法是什么?假设每个分片跟踪大约 1/n 个组并保留每个组的组密钥和所有聚合,我是否正确,还是我应该考虑其他方式?
  • 分片的数量是如何确定的?特别是,如果我查询较小的数据集,我得到的分片/资源是否会减少?

有问题的查询看起来像这样(实际上,它用作子查询,外层查询聚合结果):

The problematic query looks like this (in practice, it's used as a subquery, and the outer query aggregates the results):

SELECT
    alternative,
    snapshot_time,
    SUM(column_1),
    ...
    SUM(column_139)
FROM
        my_table
    CROSS JOIN
        [table containing 24 unix timestamps] timestamps
WHERE last_updated_time < timestamps.snapshot_time
GROUP EACH BY alternative, user_id, snapshot_time

(这是一个失败的作业示例:124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4)

(Here's an example failed job: 124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4 )

我意识到这个查询可能是自找麻烦,但在这种情况下,表只有 22MB,查询结果不到一百万个组,它仍然失败并显示资源超出".减少一次处理的时间戳数量可以修复错误,但我担心我最终会达到足够大的数据规模,以至于这种方法作为一个整体将停止工作.

I realize this query may be asking for trouble, but in this case, the table is only 22MB and the query results in under a million groups and it's still failing with "resources exceeded". Reducing the number of timestamps to process at once fixes the error, but I'm worried that I'll eventually hit a data scale large enough that this approach as a whole will stop working.

推荐答案

如您所料,BigQuery 会根据所操作的表的大小为 GROUP EACH 和 JOIN EACH 查询选择多个并行工作线程(分片).这是一个粗略的启发式方法,但在实践中,效果很好.

As you've guessed, BigQuery chooses a number of parallel workers (shards) for GROUP EACH and JOIN EACH queries based on the size of the tables being operated upon. It is a rough heuristic, but in practice, it works pretty well.

关于您的查询的有趣之处在于,由于 CROSS JOIN 中的扩展,GROUP EACH 是在比原始表更大的表上完成的.因此,我们选择了一些对于您的查询来说太小的分片.

What is interesting about your query is that the GROUP EACH is being done over a larger table than the original table because of the expansion in the CROSS JOIN. Because of this, we choose a number of shards that is too small for your query.

回答您的具体问题:

  • 资源超出几乎总是意味着工作人员内存不足.这可以是分片或混合器,在 Dremel 术语中(混合器是计算树中聚合结果的节点.GROUP EACH BY 将聚合向下推到分片,分片是计算树的叶子).

  • Resources exceeded almost always means that a worker ran out of memory. This could be a shard or a mixer, in Dremel terms (mixers are the nodes in the computation tree that aggregate results. GROUP EACH BY pushes aggregation down to the shards, which are the leaves of the computation tree).

没有一个很好的方法来估算可用资源的数量.随着时间的推移,这种情况会发生变化,目的是让您的更多查询正常工作.

There isn't a good way to approximate the amount of resources available. This changes over time, with the goal that more of your queries should just work.

分片的数量由查询中处理的总字节数决定.正如您所注意到的,这种启发式方法不适用于扩展基础数据集的连接.也就是说,正在进行积极的工作,以更明智地选择分片的数量.为了让您了解规模,您的查询仅安排在 20 个分片上,这只是更大表所能获得的一小部分.

The number of shards is determined by the total bytes processed in the query. As you've noticed, this heuristic doesn't work well with joins that expand the underlying data sets. That said, there is active work underway to be smarter about how we pick the number of shards. To give you an idea of scale, your query got scheduled on only 20 shards, which is a tiny fraction of what a larger table would get.

作为一种解决方法,您可以将 CROSS JOIN 的中间结果保存为一个表,然后在该临时表上运行 GROUP EACH BY.这应该让 BigQuery 在选择分片数量时使用扩展的大小.(如果这不起作用,请告诉我,我们可能需要调整分配阈值).

As a workaround, you could save the intermediate result of the CROSS JOIN as a table, and running the GROUP EACH BY over that temporary table. That should let BigQuery use the expanded size when picking the number of shards. (if that doesn't work, please let me know, it is possible that we need to tweak our assignment thresholds).

这篇关于了解“查询执行期间超出的资源"在 BigQuery 中使用 GROUP EACH BY的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆