BigQuery整数分区-我可以使用其他查询的结果来获取要访问的分区列表吗? [英] BigQuery Integer Partitions - can I use the results of another query to get a list of the partitions to access?

查看:108
本文介绍了BigQuery整数分区-我可以使用其他查询的结果来获取要访问的分区列表吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用整数分区(〜1TB)的大表.我需要定期制作此表的几个小子集.这花了很多钱,但是使用整数分区,我可以将成本降低95%.看起来像这样.

I have a large table using integer partitions (~1TB). I need to regularly make several small subsets of this table. This was costing a lot, but using integer partitions I can decrease the cost by like 95%. It looks something like this.

tbl_a:partition_index IN (1, 2, 5, 6, 7, 10, 11, 15, 104, 106, 111)

tbl_b:partition_index IN (3, 4, 5, 20, 21, 25, 16, 84, 201, 301, 302, 303)

,依此类推,以不同的子表使用索引的不同子集.丑陋如地狱,但它确实有效.我担心如果需要创建一个新的子表将很难维护,并且可能的排列发生变化,因此我必须编辑所有.sql文件以获得新的索引值集.我有一张小桌子,上面有我想要的条件的所有不同排列以及相关的索引值.对该索引查找表进行5Kb查询,并使用实际的子表选择条件,将产生一个索引值列表,如果将这些索引值直接复制并粘贴到.sql文件中,则可以使一切正常工作.

and so on an so forth, with different subtables using different subsets of the index. Its ugly as all hell, but it works. I'm concerned this will be difficult to maintain if I need to make a new subtable, and the potential permutations change and I have to edit all the .sql files for new sets of index values. I have a small table that has all the different permutations of the criteria I want, along with the associated index value. a 5Kb query on this index lookup table with the actual subtable selection criteria yields a list of index values, that if copied and pasted right into the .sql files, keeps everything working properly.

但是,由于架构上的原因,我无法从子查询中提取索引值并将其作为字符串插入到.sql文件中,然后再执行.我的意思是,我可以,并且可以.但是它的hacky和不好的解决方案也不是合理的.但是,我找不到一种方法来获取要正确使用的查询表上的小查询的结果.它总是导致全表扫描.这里有什么想法吗?

However, for architectural reasons, I cannot extract the index values from a subquery and insert them as a string into the .sql files prior to execution. I mean, I could, and it would work. But its hacky and bad and not reasonable solution. However, I can't find a way to get the results of the small query on the lookup table to be used properly. It always results in a full table scan. Any ideas here?

我想如果我在customerID上分区了一个大数据表,但是我只有客户名,那将是一个同样的问题. BQ似乎希望我查询名称查找表以获取ID,然后提交第二个查询,并以customerID作为字符串文字.我希望能够在单个查询中执行此操作.但是我很沮丧.

I guess an equivalent problem would be if I had a big data table partitioned on customerID, but I only had the customer name. BQ seems to want me to query the name lookup table to get the ID, then submit a second query with the customerID as a string literal. I'd like to be able to do this in a single query. But I'm stumped.

推荐答案

让我重现您的问题.

SELECT MAX(views) max_views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
WHERE DATE(datehour) IN ('2019-03-27', '2019-04-10', '2019-05-10', '2019-10-10')
AND wiki='en'
AND title = 'Barbapapa'

已处理1.4GB.

但是现在您有了一个带有这些日期的表格:

But now you have a table with those dates:

CREATE TABLE temp.some_dates AS (
  SELECT * 
  FROM UNNEST([DATE('2019-03-27'), '2019-04-10', '2019-05-10', '2019-10-10']) date
);

现在我们将运行一个查询,该查询将从该表中取出值:

And now we will run a query that takes the values out of that table:

SELECT MAX(views) max_views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
WHERE DATE(datehour) IN (SELECT * FROM temp.some_dates)
AND wiki='en'
AND title = 'Barbapapa'

已处理1.4 GB.

这里没问题:处理了相同数量的数据!为什么?该表是群集的,群集您的表.

No problem here: the same amount of data was processed! Why? This table is clustered, cluster your tables.

但是让我们看一下该表的v2,是因为事情没有聚集:

But let's see v2 of that table, were things are not clustered:

SELECT MAX(views) max_views
FROM `fh-bigquery.wikipedia_v2.pageviews_2019` 
WHERE DATE(datehour) IN ('2019-03-27', '2019-04-10', '2019-05-10', '2019-10-10')
AND wiki='en'
AND title = 'Barbapapa'

已处理26.5 GB.远远超过1.4GB.如果我只是将这张桌子聚在一起的话.

26.5 GB processed. That's a lot more than 1.4GB. If I only had clustered this table.

如果我们从其他表格中获取日期呢?

And if we get the dates out of a different table?

SELECT MAX(views) max_views
FROM `fh-bigquery.wikipedia_v2.pageviews_2019` 
WHERE DATE(datehour) IN (SELECT * FROM `temp.some_dates`)
AND wiki='en'
AND title = 'Barbapapa'

2.3 TB.

哇,那是一个很大的表扫描.我应该将表聚集在一起.

Wow, that was a really big table scan. I should have clustered my tables.

但是我能以某种方式解决此问题吗?

But can I fix this somehow?

是:

DECLARE some_dates ARRAY<DATE> DEFAULT (SELECT ARRAY_AGG(date) FROM `temp.some_dates`);


SELECT MAX(views) max_views
FROM `fh-bigquery.wikipedia_v2.pageviews_2019` 
WHERE DATE(datehour) IN UNNEST(some_dates)
AND wiki='en'
AND title = 'Barbapapa'

已处理26.46 GB.

26.46 GB processed.

不如聚簇表那么好,但是至少由于BigQuery内部运行了脚本,我们至少使用了分区:首先声明一个变量,然后使用它的内容.

Not as good as a clustered table, but at least we used the partitioning thanks to a script ran inside BigQuery: First declare a variable, then use the contents of it.

不过,我最好的建议是:对表进行群集.

Still, my best advice is: Cluster your tables.

这篇关于BigQuery整数分区-我可以使用其他查询的结果来获取要访问的分区列表吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆