什么原因导致“资源超出"?在 BigQuery 中? [英] What causes "resources exceeded" in BigQuery?

查看:19
本文介绍了什么原因导致“资源超出"?在 BigQuery 中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的查询因错误资源超出"而失败.导致此错误的原因是什么,我该如何解决?

My query failed with the error "resources exceeded". What causes this error, and how can I fix it?

推荐答案

更新 (2016-03-16):对于大多数查询,EACH 不再需要,实际上可能会增加看到这个错误.如果您在查询中的每个 JOIN 和 GROUP BY 中省略 EACH 关键字,查询引擎现在将动态优化您的查询以消除此错误.

Update (2016-03-16): For most queries, EACH is no longer required, and may actually increase the likelihood of seeing this error. If you omit the EACH keyword from every JOIN and GROUP BY in your query, the query engine will now dynamically optimize your query to eliminate this error.

在某些极端情况下,指定 EACH 关键字可以使查询运行(或运行得更快),但一般而言,BigQuery 团队建议您先尝试不使用 EACH 的查询.很快,EACH 关键字将成为完全的空操作.

There are still corner cases where specifying the EACH keyword can make a query run (or run faster), but generally speaking the BigQuery team recommends that you try your query without EACH first. Pretty soon, the EACH keyword will become a complete no-op.

原始答案:当您在 JOIN EACH 或 GROUP EACH BY 中使用 EACH 关键字时,或者当您使用 PARTITION BY 子句时,BigQuery 会根据加入键或组键,它允许每个工作任务在本地执行它的加入或聚合部分.

Original answer: When you use the EACH keyword in JOIN EACH or GROUP EACH BY, or when you use a PARTITION BY clause, BigQuery partitions ("shuffles") your data on the fly according to the join keys or group keys, which allows each worker task to perform its portion of the join or aggregation locally.

当一个这样的工作人员获取太多数据并超过其限制时,就会发生资源超出错误.一般来说,这个错误的原因分为两类:

The resources exceeded error occurs when one such worker gets too much data, and run over its limit. Generally speaking, the reasons for this error fall into two categories:

  1. 偏斜:数据严重偏向一个键值(例如,来宾"用户 ID 或空键),这意味着一名工作人员获取该键的所有记录并过载.

  1. Skew: The data is heavily skewed toward one key value (say, a "guest" user ID or a null key), which means that one worker gets all the records for that key and gets overloaded.

数据大小和工作人员数量不匹配:对于 BigQuery 分配给您查询的工作人员数量而言,您的数据过多.

Mismatch in data size and worker count: You have too much data for the number of workers that BigQuery assigned your query.

我们正在努力进行多项改进,以帮助我们应对这两种情况,这样您就无需担心这些问题.不过,就目前而言,您可以使用以下方法之一解决该问题:

We are working on a number of improvements to help us cope with both scenarios so that you don't need to worry about these issues. For now, though, you can work around the problem with one of the following approaches:

  1. 过滤掉偏斜的键.如果您的数据由于一半的连接键值实际上为空而导致数据倾斜,您可以通过在连接前添加 WHERE key IS NOT NULL 来过滤掉这些数据.

减少处理的数据量.使用 WHERE ABS(HASH(key)) % 5 == 0 过滤连接的每一侧,将连接应用到仅 1/5 的数据(或您想要的任何部分),然后执行== 1, == 2, == 3, == 4 在单独的查询中也是如此.您手动将数据分成较小的块以让查询通过——但请注意,您支付了 5 倍的费用,因为您查询了 5 次相同的数据.

Reduce the amount of data processed. Filter each side of the join with WHERE ABS(HASH(key)) % 5 == 0 to apply the join to only 1/5 of the data (or whatever fraction you want), and then do the same for == 1, == 2, == 3, == 4 in separate queries. You're manually sharding the data in smaller chunks to make the query go through--but note that you pay 5x as much because you queried the same data 5 times.

重新访问您的查询.也许您可以以完全不同的方式构建您的查询,或者计算一些中间结果,以获得您想要的答案.

Revisit your query. Maybe you can build your query in a completely different way, or compute some intermediate results, to get the answer you want.

这篇关于什么原因导致“资源超出"?在 BigQuery 中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆