BigQuery中的群集工作原理 [英] how clustering works in BigQuery
问题描述
我有一个表UNITARCHIVE,该表按日期划分,并按UNIT,DUID聚集.
I have a a table UNITARCHIVE partitionned by date, and clustered by UNIT, DUID.
表892 Mb的总大小.
the total size of the table 892 Mb.
当我尝试此查询时
SELECT * FROM `test-187010.ReportingDataset.UNITARCHIVE` WHERE duid="RRSF1" and unit="DUNIT"
Bigquery告诉我,它将处理892 mb,我认为聚类应该减少扫描的大小,我知道我按日期过滤时,大小会大大减少,但我需要整个日期范围.是设计使然还是我做错了事
Bigquery tell me, it will process 892 mb, I thought clustering is supposed to reduce the scanned size, I understand when I filter per date, the size is reduced dramatically, but i need the whole date range. is it by design or am I doing something wrong
推荐答案
要从群集中获得最大收益,每个分区都需要有一定数量的数据.
To get the most benefits out of clustering, each partition needs to have a certain amount of data.
例如,如果集群的最小大小为100MB(由BigQuery在内部决定),并且每天只有100MB的数据,则查询100天将扫描100 * 100MB,而与集群策略无关.
For example, if the minimum size of a cluster is 100MB (decided internally by BigQuery), and you have only 100MB of data per day, then querying 100 days will scan 100*100MB - regardless of the clustering strategy.
作为这种数据量的替代方法,不是按天分区,而是按年分区.这样一来,每天只需少量的数据就可以从群集中获得最大的好处.
As an alternative with this amount of data, instead of partitioning by day, partition by year. Then you'll get the most benefits out of clustering with a low amount of data per day.
请参阅>按周/年/一个月才能超过分区限制?以获得显示此限制的参考表.
See Partition by week/year/month to get over the partition limit? for a reference table that shows this off.
这篇关于BigQuery中的群集工作原理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!