如何提高使用分区+集群表查询的数据量? [英] How can I improve the amount of data queried with a partitioned+clustered table?

查看:57
本文介绍了如何提高使用分区+集群表查询的数据量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个BigQuery表-天已分区并已群集.但是,当我对其进行查询时,它仍然使用大量数据.这怎么可能?

I have a BigQuery table - day partitioned, and clustered. However, it still uses a lot of data when I run queries over it. How is this possible?

推荐答案

有时没有分区,或者每周/每月/每年的分区比每天使用分区表+群集会更好.

Sometimes no partitions, or weekly/monthly/yearly partitions will work way better than having a daily partitioned table + clustering.

这是因为BigQuery中的每个数据簇都具有最小大小.如果每日分区表中每天的数据量少于该数据量,则群集表根本不会带来任何好处.

This because each cluster of data in BigQuery has a minimum size. If each day of data in a daily partitioned table has less than that amount of data, you won't see any benefits at all from clustering your table.

例如,让我们创建一个具有30多年天气的表.我将按月对该表进行分区(以将多个年份容纳在一个表中):

For example, let's create a table with 30+ years of weather. I will partition this table by month (to fit multiple years into one table):

CREATE TABLE `temp.gsod_partitioned`
PARTITION BY date_month
CLUSTER BY name
AS 
SELECT *, DATE_TRUNC(date, MONTH) date_month
FROM `fh-bigquery.weather_gsod.all` 

现在,让我们对其进行查询-使用聚类字段name:

Now, let's run a query over it - using the clustering field name:

SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `temp.gsod_partitioned`
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC 
# (2.3 sec elapsed, 3.1 GB processed)

现在,让我们在一个相同的表上执行此操作-按假日期进行分区(因此实际上没有分区),并按同一列进行群集:

Now, let's do this over an identical table - partitioned by a fake date (so no partitioning really), and clustered by the same column:

SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all` 
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC
# (1.5 sec elapsed, 62.8 MB processed)

仅处理了62.8 MB数据(对比3.1GB)!

Only 62.8 MB of data (vs 3.1GB) were processed!

这是因为在每天没有大量GB的表上,没有分区的群集效率要高得多.

This because clustering without partitions is much more efficient on tables that don't have a lot of GB per day.

奖金:按地理位置分组:

Bonus: Clustered by geo:

SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until  
FROM `fh-bigquery.weather_gsod.all_geoclustered`  
WHERE date > '1980-01-01'
AND ST_DISTANCE(point_gis, ST_GEOGPOINT(-122.465, 37.807)) < 40000
GROUP BY 1,2
ORDER BY ST_DISTANCE(ANY_VALUE(point_gis), ST_GEOGPOINT(-122.465, 37.807))
# (2.1 sec elapsed, 100.7 MB processed)

这篇关于如何提高使用分区+集群表查询的数据量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆