如何定义引用子查询中的通配符表的BQ视图? [英] How to define a BQ view that references wildcard tables within subqueries?

查看:86
本文介绍了如何定义引用子查询中的通配符表的BQ视图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个查询,希望将其变成一个视图.该查询取决于几个子查询,看起来像这样:

I have a query that I wish to turn into a view. The query depends on several subqueries and looks something like this:

WITH subquery1 AS (
  date,
  key1,
  key2,  
  other_fields,
  ....
), subquery2 AS (
  date,
  key1,
  key2,  
  other_fields,
  ....
)

SELECT
  date,
  key1,
  key2,
  other_fields...
FROM table
JOIN subquery1 USING(key1, key2)
JOIN subquery2 USING(key1, key2)

每个子查询都引用相同的Google Analytics(分析)会话数据,这些数据分为数据分区"表,具有相同表前缀和在日期字段中引用的后缀"yyyymmdd"的表.

Each of the subqueries references the same Google Analytics session data which is divided into 'data partitioned' tables, those tables that have the same table prefix and the suffix 'yyyymmdd' referenced in a date field.

我希望该视图能够在所有子查询中以及在直接查询表时(如所示查询中)选择相关的日期表分区.

I wish for the view to be able to select the relevant date table partitions within all sub-queries and when querying the table directly as in the query shown.

我没有有效的代码-我开始认为这是不可能的,可能是因为它涉及相关的子查询.

I have no working code - I am starting to think that this isn't possible, possibly because it involves correlated sub-queries.

这不可能吗?或者,如果可能的话,什么样的结构/语法可以实现?

Is this not possible? Or if it is possible, what sort of structure/syntax would achieve it?

更新

动机是限制查询中选择的数据量.在选择最近的几个日期时,我可以将选择的数据量从500GB限制为数十GB.

The motivation is to limit the amount of data that is selected in the query. In picking just a few recent dates I can limit the amount of data selected from 500GBs to tens of GBs.

要重申-这些不是正式的日期分区表,它们是一系列带有通用前缀且均以yyyymmdd格式结尾的单独表.我选择这些范围没有问题,但是我认为这对我定义视图没有帮助.

To re-iterate - these aren't formally date partitioned tables, they are a series of individual tables with common prefixes all ending in yyyymmdd format. I have no problem selecting a range of these, but this doesn't help me for defining a view, I don't think.

更新2

这是我尝试过的方法,但是查询不同日期不会影响所选数据量:

Here's what I've tried, but querying with varying dates doesn't affect the amount of data selected:

WITH revenue AS(
  SELECT DISTINCT 
    key1,
    key2,
    date,
    transactions,
    revenue
  FROM `project.dataset.revenue` AS main
)

SELECT
  _TABLE_SUFFIX AS date,
  main.key1 AS m_fvi,
  main.key2 AS m_vi,
  revenue.transactions,
  revenue.revenue
FROM `project.dataset.ga_sessions_*` AS main
LEFT JOIN revenue USING(key1, key2)
WHERE REGEXP_CONTAINS(_TABLE_SUFFIX, r'[0-9]{8}')

消费查询:

SELECT  
  *
FROM `project.dataset.view`
WHERE date = '20180701'

选择与以下相同数量的数据:

Selects same quantity of data as:

SELECT  
  *
FROM `project.dataset.view`

它应该选择与ga_sessions_ *内的分区"数量内联的数据的大约千分之一

It should select roughly 1000th of the data which is inline with the number of 'partitions' that are within ga_sessions_*

推荐答案

@ElliottBrossard用户在评论中建议使用

User @ElliottBrossard proposed in a comment to use _TABLE_SUFFIX. This is the correct approach to what you were trying to do. The confusion comes when querying the view and the expected billing results.

使用_TABLE_SUFFIX可以正确创建视图.它从正确的表中获取正确的数据,并将其置于新视图中.现在,运行带有WHERE子句的查询:

The view is created correctly using _TABLE_SUFFIX. It gets the correct data from the correct tables and puts them in a new view. Now, the query with the WHERE clause is run:

SELECT *
FROM `project.dataset.view`
WHERE date = '20180701'

此查询将读取包含与正则表达式REGEXP_CONTAINS(_TABLE_SUFFIX, r'[0-9]{8}')匹配的所有表的整个视图,然后过滤出与WHERE子句不匹配的结果.第二个查询执行的操作基本上相同,但减去过滤后的结果.这意味着BigQuery正在处理的数据量在两个查询中都是相同的,从而导致两个查询的计费量相同.

This query will read the whole view consisting of all of the tables that matched the regex REGEXP_CONTAINS(_TABLE_SUFFIX, r'[0-9]{8}') and then filter out the results that didn't match the WHERE clause. The second query does essentially the same minus the filtering. This means that the amount of data that BigQuery is processing is the same in both queries, resulting in the same amount being billed for both of them.

我所理解的解决方案是为每个表创建一个后缀为'yyyymmdd'的视图.这样,当您查询所需时间戳的视图时,BigQuery将处理更少的数据量.

What I understand as a solution for this would be to create a view for each table with a suffix like 'yyyymmdd'. This way, when you query the view of the timestamp you need, BigQuery will process less amount of data.

这篇关于如何定义引用子查询中的通配符表的BQ视图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆