使用WHERE子句中的子查询的Google BigQuery优化 [英] Google BigQuery optimization with subquery in WHERE clause

查看:58
本文介绍了使用WHERE子句中的子查询的Google BigQuery优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试建立一个查询,该查询从一系列Google Analytics(分析)会话数据的每日分区中选择一个数据子集,并将数据写入Google BigQuery登台表.我面临的挑战是降低在WHERE子句中使用子查询时的处理成本.

I am attempting to set up a query that selects a subset of data from a range of daily partitions of Google Analytics session data and writes the data to a Google BigQuery staging table. The challenge for me is to reduce the processing cost when using a subquery in the WHERE clause.

来自查询的Google Analytics(分析)数据将被附加到登台表中,然后再处理并加载到目标数据表(my-data-table)中.主要查询以以下两种形式给出.第一个是硬编码的.第二个反映了首选形式._TABLE_SUFFIX的上限是硬编码的,以简化查询.目的是使用my(-my-data-table)中的日期格式为YYYYMMDD的MAX(date)作为ga_sessions_ *每日分区的下限.该查询已在此处进行了简化,以便表示,但据信包含所有必要的元素.

Google Analytics data from the query are to be appended to a staging table before being processed and loaded into the target data table (my-data-table). The main query is given in two forms below. The first is hard-coded. The second reflects the preferred form. The upper bound on _TABLE_SUFFIX is hard-coded for both to simplify the query. The objective is to use MAX(date), where date has the form YYYYMMDD, from my-data-table as a lower bound on the ga_sessions_* daily partitions. The query has been simplified for presentation here but is believed to contain all necessary elements.

汇总查询(来自 my-project-12345.dataset.my-data-table 的SELECT MAX(date))返回值"20201015"并处理202 KB.根据我是在主查询的WHERE子句中显式使用返回值(如"20201015")还是在WHERE子句中使用SELECT MAX()查询,两个查询之间处理的数据存在显着差异(2.3显式值为GB,而SELECT MAX()表达式为138.1 GB).

The aggregate query (SELECT MAX(date) FROM my-project-12345.dataset.my-data-table) returns the value '20201015' and processes 202 KB. Depending upon whether I use the returned value explicitly (as '20201015') in the WHERE clause of the main query or use the SELECT MAX() query in the WHERE clause, there is a significant difference in data processed between the two queries (2.3 GB for the explicit value vs 138.1 GB for the SELECT MAX() expression).

是否可以将优化,计划或指令应用于主查询的首选形式,以减少数据处理成本?感谢您提供的任何帮助.

Is there an optimization, plan, or directive that can be applied to the preferred form of the main query that will reduce the data processing cost? Thank you for any assistance that can be provided.

主查询(硬编码版本,处理2.3 GB)

SELECT
  GA.date, 
  GA.field1, 
  hits.field2, 
  hits.field3
FROM 
  `my-project-12345.dataset.ga_sessions_*` AS GA, UNNEST(GA.hits) AS hits
WHERE 
  hits.type IN ('PAGE', 'EVENT')
  AND hits.field0 = 'some value'
  AND _TABLE_SUFFIX > '20201015'
  AND _TABLE_SUFFIX < '20201025' 

主查询(首选形式,无需优化即可处理138.1 GB)

SELECT
  GA.date, 
  GA.field1, 
  hits.field2, 
  hits.field3
FROM 
  `my-project-12345.dataset.ga_sessions_*` AS GA, UNNEST(GA.hits) AS hits
WHERE 
  hits.type IN ('PAGE', 'EVENT')
  AND hits.field0 = 'some value'
  AND _TABLE_SUFFIX > (SELECT MAX(date) FROM `my-project-12345.dataset.my-data-table`)
  AND _TABLE_SUFFIX < '20201025' 

推荐答案

您可以为此使用脚本

把戏"正在进行预计算

DECLARE start_date STRING;
SET start_date = (SELECT MAX(date) FROM `my-project-12345.dataset.my-data-table`);    

并分配给变量,然后在主查询的where子句中使用此变量-在这种情况下,它将使用具有成本效益的版本

and assigning to variable and then use this variable in where clause on main query - in this case it will use cost effective version

AND _TABLE_SUFFIX > start_date
AND _TABLE_SUFFIX < '20201025' 

这篇关于使用WHERE子句中的子查询的Google BigQuery优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆