BigQuery 中的聚类是如何工作的 [英] how clustering works in BigQuery

查看:19
本文介绍了BigQuery 中的聚类是如何工作的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表 UNITARCHIVE 按日期分区,并按 UNIT、DUID 聚集.

I have a a table UNITARCHIVE partitionned by date, and clustered by UNIT, DUID.

表的总大小为 892 Mb.

the total size of the table 892 Mb.

当我尝试这个查询时

SELECT * FROM `test-187010.ReportingDataset.UNITARCHIVE` WHERE duid="RRSF1" and unit="DUNIT"

Bigquery 告诉我,它将处理 892 mb,我认为聚类应该减少扫描的大小,我明白当我按日期过滤时,大小会大大减少,但我需要整个日期范围.是有意为之还是我做错了什么

Bigquery tell me, it will process 892 mb, I thought clustering is supposed to reduce the scanned size, I understand when I filter per date, the size is reduced dramatically, but i need the whole date range. is it by design or am I doing something wrong

推荐答案

为了从聚类中获得最大的好处,每个分区都需要有一定数量的数据.

To get the most benefits out of clustering, each partition needs to have a certain amount of data.

例如,如果集群的最小大小为 100MB(由 BigQuery 内部决定),而您每天只有 100MB 的数据,那么查询 100 天将扫描 100*100MB - 无论集群策略如何.

For example, if the minimum size of a cluster is 100MB (decided internally by BigQuery), and you have only 100MB of data per day, then querying 100 days will scan 100*100MB - regardless of the clustering strategy.

作为这种数据量的替代方案,不是按天分区,而是按年分区.然后,您将从每天少量数据的集群中获得最大收益.

As an alternative with this amount of data, instead of partitioning by day, partition by year. Then you'll get the most benefits out of clustering with a low amount of data per day.

按周/年分区/每月超过分区限制? 用于显示这一点的参考表.

See Partition by week/year/month to get over the partition limit? for a reference table that shows this off.

这篇关于BigQuery 中的聚类是如何工作的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆