解决“超出资源"的问题在BigQuery中并使其运行更快 [英] Fixing "Resources exceeded" in BigQuery and making it run faster

查看:59
本文介绍了解决“超出资源"的问题在BigQuery中并使其运行更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

GDELT的Kalev Leetaru遇到了这个问题-当分析一个月时,以下查询将在BigQuery中运行,但是整整一年都不会运行.

  SELECT Source,Target,count,RATIO_TO_REPORT(count)OVER()权重从 (选择a.name源,b.name目标,COUNT(*)个从(FLATTEN(选择GKGRECORDID,CONCAT(STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),*?)#'),'0')),3)),'#',STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';')),r'^ [2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'),'0')),3)))AS名称来自[gdelt-bq:gdeltv2.gkg]DATE> 20150100000000和DATE< 20151299999999,名称))a加入每个(选择GKGRECORDID,CONCAT(STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'*?)#'),'0')),3)),'#',STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';')),r'^ [2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'),'0')),3)))AS名称来自[gdelt-bq:gdeltv2.gkg]DATE> 20150100000000和DATE< 20151299999999)b在a.GKGRECORDID = b.GKGRECORDID上在哪里a.name< b.nameAND a.name!='0.000000#0.000000'AND b.name!='0.000000#0.000000'按1、2分组按3 DESC排序)计数>50限制500000 

查询执行期间超出了资源."

我们如何解决这个问题?

解决方案

首先要注意成本优化:BigQuery每扫描一列收费,此查询将超过72GB.GDELT gkg表将整个故事存储在一个表中-我们可以通过创建年度表而不是单个表来优化成本.

现在,我们如何解决该查询,使其可以运行整整一年?查询执行期间超出的资源"通常来自不可伸缩的函数.例如:

  • RATIO_TO_REPORT(COUNT)OVER()无法缩放:OVER()函数在整个结果集中运行,从而使我们能够计算总数以及每一行占总数的多少-但是要运行此命令,我们需要将整个结果集放入一个VM中.好消息是,在对数据进行分区时,OVER()能够扩展,例如通过拥有OVER(PARTITION BY month)-那么我们只需要将每个分区都适合于VM.对于此查询,为简单起见,我们将删除此结果列.

  • ORDER BY无法缩放:要对结果进行排序,我们也需要所有结果以适合一台VM.这就是为什么'--allow-large-results'不允许运行ORDER BY步骤的原因,因为每个VM都会并行处理和输出结果.

在此查询中,我们有一种简单的方法来处理ORDER BY可伸缩性-我们将稍后将稍后的过滤器"WHERE COUNT> 50"移到该过程中.我们将不对所有结果进行排序并过滤具有COUNT> 50的结果,而是将其移动并将其更改为HAVING,以便它在ORDER BY之前运行:

  SELECT源,目标,计数从 (选择a.name源,b.name目标,COUNT(*)个从(FLATTEN(选择GKGRECORDID,CONCAT(STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),*?)#'),'0')),3)),'#',STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';')),r'^ [2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'),'0')),3)))AS名称来自[gdelt-bq:gdeltv2.gkg]DATE> 20150100000000和DATE< 20151299999999,名称))a加入每个(选择GKGRECORDID,CONCAT(STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),*?)#'),'0')),3)),'#',STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';')),r'^ [2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'),'0')),3)))AS名称来自[gdelt-bq:gdeltv2.gkg]DATE> 20150100000000和DATE< 20151299999999)b在a.GKGRECORDID = b.GKGRECORDID上在哪里a.name< b.nameAND a.name!='0.000000#0.000000'AND b.name!='0.000000#0.000000'按1、2分组计数> 50按3 DESC排序)限制500000 

现在查询运行了整整一年的数据!

让我们看一下解释统计数据:

我们可以看到1.88亿行表被读取了两次:第一个子查询产生了15亿行(给定了"FLATTEN"),第二个子查询过滤掉了2015年以后的行(请注意,该表开始存储数据在2015年初).

第3阶段很有趣:将两个子查询连接在一起就产生了30亿行!通过FILTER和AGGREGATE步骤将其减少到5亿:

我们可以做得更好吗?

是的!让我们将2 WHERE a.name!='....'移至较早的"HAVING":

  SELECT源,目标,计数从 (选择a.name源,b.name目标,COUNT(*)个从(FLATTEN(选择GKGRECORDID,CONCAT(STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),*?)#'),'0')),3)),'#',STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';')),r'^ [2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'),'0')),3)))AS名称来自[gdelt-bq:gdeltv2.gkg]DATE> 20150100000000和DATE< 20151299999999HAVING name!='0.000000#0.000000',name))a加入每个(选择GKGRECORDID,CONCAT(STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),*?)#'),'0')),3)),'#',STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';')),r'^ [2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'),'0')),3)))AS名称来自[gdelt-bq:gdeltv2.gkg]DATE> 20150100000000和DATE< 20151299999999具有名称!='0.000000#0.000000')b在a.GKGRECORDID = b.GKGRECORDID上a.name< b.name按1、2分组计数> 50按3 DESC排序)限制500000 

运行速度更快!

让我们看一下解释统计数据:

看到了吗?通过将过滤移到加入之前的某个步骤,阶段3只需经过10亿行,而不是30亿行.速度要快得多(即使对于BigQuery,您也可以自己检查,它能够在很短的时间内处理超过30亿条JOIN生成的行).

此查询的目的是什么?

在这里查看漂亮的结果:

GDELT's Kalev Leetaru had this problem - the following query will run in BigQuery when analyzing a full month, but it won't run when going over a whole year.

SELECT Source, Target, count, RATIO_TO_REPORT(count) OVER() Weight
FROM (
  SELECT a.name Source, b.name Target, COUNT(*) AS COUNT
  FROM (FLATTEN(
      SELECT
        GKGRECORDID, CONCAT( STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)), '#', STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)) ) AS name
      FROM [gdelt-bq:gdeltv2.gkg]
      WHERE DATE>20150100000000 and DATE<20151299999999, name)) a
  JOIN EACH (
    SELECT 
      GKGRECORDID, CONCAT( STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)), '#', STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)) ) AS name
    FROM [gdelt-bq:gdeltv2.gkg]
    WHERE DATE>20150100000000 and DATE<20151299999999 ) b
  ON a.GKGRECORDID=b.GKGRECORDID
  WHERE a.name<b.name
    AND a.name != '0.000000#0.000000'
    AND b.name != '0.000000#0.000000'
  GROUP EACH BY 1, 2
  ORDER BY 3 DESC )
WHERE count > 50
LIMIT 500000

"Resources exceeded during query execution."

How can we fix this?

解决方案

First a note on cost optimization: BigQuery charges per column scanned, and this query will go over 72GBs. The GDELT gkg table stores its whole story in one table - we can optimize costs by creating yearly tables instead of a single one.

Now, how can we fix this query so it runs over a full year? "Resources exceeded during query execution" usually comes from non-scalable functions. For example:

  • RATIO_TO_REPORT(COUNT) OVER() won't scale: OVER() functions run over the whole result set, allowing us to compute totals and how much of the total each row contributes - but for this to run we need the whole result set to fit in one VM. The good news is that OVER() is able to scale when partitioning data, for example by having a OVER(PARTITION BY month) - then we would only need each partition to fit in a VM. For this query, we will remove this result column instead, for simplicity.

  • ORDER BY won't scale: To sort results, we need all of the results to fit on one VM too. That's why '--allow-large-results' won't allow running an ORDER BY step, as each VM will process and output results in parallel.

In this query we have an easy way to deal with ORDER BY scalability - we will move the later filter "WHERE COUNT > 50" earlier into the process. Instead of sorting all the results, and filtering the ones that had a COUNT>50, we will move it and change it to a HAVING, so it runs before the ORDER BY:

SELECT Source, Target, count
FROM (
  SELECT a.name Source, b.name Target, COUNT(*) AS COUNT
  FROM (FLATTEN(
      SELECT
        GKGRECORDID, CONCAT( STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)), '#', STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)) ) AS name
      FROM [gdelt-bq:gdeltv2.gkg]
      WHERE DATE>20150100000000 and DATE<20151299999999,name)) a
  JOIN EACH (
    SELECT 
      GKGRECORDID, CONCAT( STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)), '#', STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)) ) AS name
    FROM [gdelt-bq:gdeltv2.gkg]
    WHERE DATE>20150100000000 and DATE<20151299999999 ) b
  ON a.GKGRECORDID=b.GKGRECORDID
  WHERE a.name<b.name
    AND a.name != '0.000000#0.000000'
    AND b.name != '0.000000#0.000000'
  GROUP EACH BY 1, 2
  HAVING count>50
  ORDER BY 3 DESC )
LIMIT 500000

And now the query runs over a full year of data!

Let's look at the explanation stats:

We can see that the 188 million row table was read twice: The first subquery produced 1.5 billion rows (given the "FLATTEN"), and the second one filtered out the rows not in 2015 (note that this table started storing data in early 2015).

Stage 3 is interesting: Joining both subqueries produced 3 billion rows! Those got reduced to 500 million with the FILTER and AGGREGATE steps:

Can we do better?

Yes! Let's move the 2 WHERE a.name != '....' to an earlier "HAVING":

SELECT Source, Target, count
FROM (
  SELECT a.name Source, b.name Target, COUNT(*) AS COUNT
  FROM (FLATTEN(
      SELECT
        GKGRECORDID, CONCAT( STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)), '#', STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)) ) AS name
      FROM [gdelt-bq:gdeltv2.gkg]
      WHERE DATE>20150100000000 and DATE<20151299999999
      HAVING name != '0.000000#0.000000',name)) a
  JOIN EACH (
    SELECT 
      GKGRECORDID, CONCAT( STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)), '#', STRING(ROUND(FLOAT(IFNULL(REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#'), '0')), 3)) ) AS name
    FROM [gdelt-bq:gdeltv2.gkg]
    WHERE DATE>20150100000000 and DATE<20151299999999 
    HAVING name != '0.000000#0.000000') b
  ON a.GKGRECORDID=b.GKGRECORDID
  WHERE a.name<b.name
  GROUP EACH BY 1, 2
  HAVING count>50
  ORDER BY 3 DESC )
LIMIT 500000

This runs even faster!

Let's look at the explanation stats:

See? By moving the filtering to a step before joining, stage 3 only has to go through 1 billion rows, instead of 3 billion rows. Much faster (even for BigQuery, that as you can check by yourself, is capable of going over 3 billion rows generated by a JOIN in a short amount of time).

And what was this query for?

Look at beautiful results here: http://blog.gdeltproject.org/a-city-level-network-diagram-of-2015-in-one-line-of-sql/

这篇关于解决“超出资源"的问题在BigQuery中并使其运行更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆