如何提高分区+集群表查询的数据量? [英] How can I improve the amount of data queried with a partitioned+clustered table?

查看:14
本文介绍了如何提高分区+集群表查询的数据量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 BigQuery 表 - 日分区和集群.但是,当我对其运行查询时,它仍然会使用大量数据.这怎么可能?

I have a BigQuery table - day partitioned, and clustered. However, it still uses a lot of data when I run queries over it. How is this possible?

推荐答案

有时没有分区,或者每周/每月/每年分区会比每天分区表 + 集群更有效.

Sometimes no partitions, or weekly/monthly/yearly partitions will work way better than having a daily partitioned table + clustering.

这是因为 BigQuery 中的每个数据集群都有一个最小大小.如果每日分区表中每天的数据量少于该数据量,您将根本看不到对表进行聚类的任何好处.

This because each cluster of data in BigQuery has a minimum size. If each day of data in a daily partitioned table has less than that amount of data, you won't see any benefits at all from clustering your table.

例如,让我们创建一个包含 30 多年天气的表格.我将按月对这张表进行分区(将多个年份放入一张表中):

For example, let's create a table with 30+ years of weather. I will partition this table by month (to fit multiple years into one table):

CREATE TABLE `temp.gsod_partitioned`
PARTITION BY date_month
CLUSTER BY name
AS 
SELECT *, DATE_TRUNC(date, MONTH) date_month
FROM `fh-bigquery.weather_gsod.all` 

现在,让我们对其运行查询 - 使用聚类字段 name:

Now, let's run a query over it - using the clustering field name:

SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `temp.gsod_partitioned`
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC 
# (2.3 sec elapsed, 3.1 GB processed)

现在,让我们在一个相同的表上执行此操作 - 按假日期分区(因此实际上没有分区),并按同一列聚集:

Now, let's do this over an identical table - partitioned by a fake date (so no partitioning really), and clustered by the same column:

SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all` 
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC
# (1.5 sec elapsed, 62.8 MB processed)

仅处理了 62.8 MB 的数据(对比 3.1GB)!

Only 62.8 MB of data (vs 3.1GB) were processed!

这是因为没有分区的集群在每天没有大量 GB 的表上效率更高.

This because clustering without partitions is much more efficient on tables that don't have a lot of GB per day.

奖励:按地理分组:

SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until  
FROM `fh-bigquery.weather_gsod.all_geoclustered`  
WHERE date > '1980-01-01'
AND ST_DISTANCE(point_gis, ST_GEOGPOINT(-122.465, 37.807)) < 40000
GROUP BY 1,2
ORDER BY ST_DISTANCE(ANY_VALUE(point_gis), ST_GEOGPOINT(-122.465, 37.807))
# (2.1 sec elapsed, 100.7 MB processed)

这篇关于如何提高分区+集群表查询的数据量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆