使用CREATE TABLE在BigQuery中进行聚类 [英] Clustering in BigQuery using CREATE TABLE

查看:49
本文介绍了使用CREATE TABLE在BigQuery中进行聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

不确定我是否正确群集.基本上我正在查看50个客户的GCP结算信息.每个客户都有一个Billing_ID,我在该billing_ID上聚类.我将群集表用于Data Studio仪表板

Unsure if I cluster correctly. Basicly I am looking at GCP Billing Info of say 50 clients. Each client has a Billing_ID and I cluster on that billing_ID. I use the clustered table for a data studio dashboard

请参阅下面的SQL查询以了解我现在所做的事情

See the the SQL query below to see what I do right now

CREATE OR REPLACE TABLE `dashboardgcp`
  PARTITION BY DATE(usage_start_time)
  CLUSTER BY billing_account_id
  AS
SELECT
  *
FROM
  `datagcp`
WHERE
 usage_start_time BETWEEN TIMESTAMP('2019-01-01')
  AND TIMESTAMP(CURRENT_DATE)

它成功地像这样集群了,我只是查询性能没有明显提高!

It is succesfully clustered like this, I am just not a noticeable query performance increase!

推荐答案

因此,我认为通过将它与billing_ID集群在一起,我应该会看到仪表板性能的提高

So I thought by clustering it with billing_ID I should see an increase in dashboard performance

请考虑以下几点:

集群结构
丛集栏位由BigQuery中的状态数组(由框到外,由内到外)组成,状态为

当您使用多个列对一个表进行聚类时,指定的列顺序很重要.指定列的顺序确定数据的排序顺序.

When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

这意味着正如@Gordon所写,在您的查询中, WHERE 部分需要从外部字段开始到内部字段开始,以充分利用您的集群字段.在您的情况下,如果 userId WHERE 的一部分,则需要更改群集字段以与此匹配

This means As @Gordon wrote, in your query the WHERE part needs to start from the outer field to the inner one to make the most out of your cluster field. In your case, if the userId is part of the WHERE you need to change your cluster field to match this

集群限制
对于扫描超过1GB数据的查询,群集通常会更好地工作.因此,如果您不扫描此数据量,则不会看到您想要的改进

Cluster limitation
Cluster typically works better for query who scan over 1GB of data, So if you are not scanning this amount of data you won't see the improvement you are looking for

带有摄取表的集群
假设您的数据不是静态的,并且您一直向表 datagcp 中添加数据,则需要注意的是,簇索引是BigQuery离线执行插入操作的过程,而另一过程是BigQuery离线执行的过程.分区.
副作用是,随着时间的推移,群集会减弱".为了解决这个问题,您将需要使用 merge 命令来重新构建群集,以充分利用群集

Cluster with Ingestion tables
Assuming your dara is not static and you keep adding data to your table, datagcp, you need to be aware that cluster indexing is a process which BigQuery perform off-line to the insert operation and a separate one to partitioning.
The side effect is that your clustering "weaken" over-time. To solve this you will need to use merge command to re-build your cluster in order to get the most out of your cluster

从文档中:

随着时间的流逝,随着越来越多的操作修改表,数据的排序程度开始减弱,并且表被部分排序".

"Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted".

这篇关于使用CREATE TABLE在BigQuery中进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆