什么原因导致“资源超出”在BigQuery中? [英] What causes "resources exceeded" in BigQuery?

查看:129
本文介绍了什么原因导致“资源超出”在BigQuery中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的查询失败,出现资源超出错误。什么原因导致这个错误,我该如何修复它? 更新(2016-03-16):对于大多数查询,不再需要EACH,并且实际上可能增加看到此错误的可能性。如果您在查询中忽略每个JOIN和GROUP BY中的EACH关键字,查询引擎现在会动态优化您的查询以消除此错误。



其中指定EACH关键字可以使查询运行(或运行得更快),但一般来说,BigQuery团队建议您先不使用EACH来尝试查询。不久之后,EACH关键字将成为完全禁止使用。






原始答案:当您在JOIN EACH或GROUP EACH BY中使用EACH关键字时,或者当您使用PARTITION BY子句时,BigQuery会根据连接键或组键快速分区(洗牌)您的数据,从而允许每个工作任务在本地执行其连接或聚合的部分。



当一个这样的工作人员获取太多数据并超出其极限时,会发生资源超出错误。一般来说,这种错误的原因可分为两类:


  1. 偏斜:数据严重偏向一个关键值例如,guest用户ID或空键),这意味着一名工作人员获取该键的所有记录并被重载。


  2. 不匹配数据大小和工作人员数量:您的BigQuery分配查询的工作人员数量太多。

  3. 正在进行一系列改进以帮助我们应对这两种情况,以便您不必担心这些问题。但现在,您可以使用以下方法之一解决问题:


    1. 过滤掉倾斜的键。如果您的数据因为一半的连接键值实际为空而发生偏斜,则可以在连接之前通过添加 WHERE键非空来过滤掉这些数据。


    2. 减少处理的数据量。使用 WHERE ABS(HASH(key))%5 == 0 筛选连接的每一面,以便仅将连接应用于数据的1/5(或任何分数想要),然后对 == 1 == 2 = = 3 == 4 分开查询。您手动将数据分成较小的块以使查询通过 - 但请注意,您支付的数额是5倍,因为您查询了相同的数据5次。


    3. 重新查看您的查询。也许你可以用完全不同的方式构建你的查询,或者计算一些中间结果,以得到你想要的答案。



    My query failed with the error "resources exceeded". What causes this error, and how can I fix it?

    解决方案

    Update (2016-03-16): For most queries, EACH is no longer required, and may actually increase the likelihood of seeing this error. If you omit the EACH keyword from every JOIN and GROUP BY in your query, the query engine will now dynamically optimize your query to eliminate this error.

    There are still corner cases where specifying the EACH keyword can make a query run (or run faster), but generally speaking the BigQuery team recommends that you try your query without EACH first. Pretty soon, the EACH keyword will become a complete no-op.


    Original answer: When you use the EACH keyword in JOIN EACH or GROUP EACH BY, or when you use a PARTITION BY clause, BigQuery partitions ("shuffles") your data on the fly according to the join keys or group keys, which allows each worker task to perform its portion of the join or aggregation locally.

    The resources exceeded error occurs when one such worker gets too much data, and run over its limit. Generally speaking, the reasons for this error fall into two categories:

    1. Skew: The data is heavily skewed toward one key value (say, a "guest" user ID or a null key), which means that one worker gets all the records for that key and gets overloaded.

    2. Mismatch in data size and worker count: You have too much data for the number of workers that BigQuery assigned your query.

    We are working on a number of improvements to help us cope with both scenarios so that you don't need to worry about these issues. For now, though, you can work around the problem with one of the following approaches:

    1. Filter out skewed keys. If your data is skewed because half of your join key values are actually null, you could filter those out by adding WHERE key IS NOT NULL prior to the join.

    2. Reduce the amount of data processed. Filter each side of the join with WHERE ABS(HASH(key)) % 5 == 0 to apply the join to only 1/5 of the data (or whatever fraction you want), and then do the same for == 1, == 2, == 3, == 4 in separate queries. You're manually sharding the data in smaller chunks to make the query go through--but note that you pay 5x as much because you queried the same data 5 times.

    3. Revisit your query. Maybe you can build your query in a completely different way, or compute some intermediate results, to get the answer you want.

    这篇关于什么原因导致“资源超出”在BigQuery中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆