为什么今天的群集/分区上的查询成本比以前的日期高得多? [英] Why the cost of a query on today cluster/partition is much higher than on previous dates?

查看:55
本文介绍了为什么今天的群集/分区上的查询成本比以前的日期高得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个分区/集群表,如下所示:

I have a partition/cluster table as follow:

当我运行此查询时:

SELECT
  projectId
FROM
  `projectId.dataset.tables`
WHERE _PARTITIONTIME >= "2019-03-16 00:00:00" AND _PARTITIONTIME <= "2019-03-17 00:00:00" 
  AND projectId='myproject' 
GROUP BY
  projectId
limit 1

我看到的实际扫描为 597 MB

但是,当我在前一天运行相同的查询时,如下所示:

However, When I run the same query on the previous day as follow:

SELECT
  projectId
FROM
  `projectId.dataset.tables`
WHERE _PARTITIONTIME >= "2019-03-15 00:00:00" AND _PARTITIONTIME <= "2019-03-16 00:00:00" 
  AND projectId='myproject' 
GROUP BY
  projectId
limit 1

我看到的实际扫描为 122 MB

注意:如果添加更多列,结果将更加糟糕.

Note: The results are even worse if I add more columns.

为确保分区大小相同,我计算了每个分区中的projectId数量

To make sure my partition has the same size I counted the number of projectId in each partition

SELECT _partitionTime as date, count(projectId) as count
FROM
  `projectId.dataset.tables`
WHERE _PARTITIONTIME >= "2019-03-15 00:00:00" AND _PARTITIONTIME <= "2019-03-17 00:00:00" 
GROUP BY
  date

今天您可以看到,分区的行数比前两天还要少

And as you can see today partition has even fewer rows than the previous 2 days

此外,我尝试使用此查询来查询流缓冲区,但没有返回结果

In addtion I tried to query the streaming buffer using this query which returned no result

SELECT projectId FROM `projectId.dataset.tables`
WHERE _PARTITIONTIME IS NULL

我的结论是流式缓冲正在影响群集表上的查询成本,但我不确定这是怎么回事以及为什么.

My conclusion is that the streaming buffer is impacting the cost of the query on a cluster table but I'm not sure how can that be and why.

关于这里发生的事情以及为什么在查询今天的分区时为什么看到更高的成本的任何想法

Any ideas on what is going on here and why do I see higher cost when querying today partition

推荐答案

对表进行群集时,基本上是在选择存储时对其进行物理排序的方式.

When you cluster a table, you're basically choosing how to physically sort it while stored.

当您流式传输到表中时,新行将按接收到的顺序粗略地存储,因此违反了群集的按物理排序"的承诺.

When you stream into a table, new rows are stored roughly in the order received, hence breaking the "physically sorted" promise of clustering.

BigQuery应该足够聪明,可以偶尔对集群表进行无提示的重新排序,但是如果该过程没有运行,您将看不到集群的好处.

BigQuery should be smart enough to silently re-order your clustered tables once in a while, but if that process has not run, you won't see the benefits of clustering.

根据当前发布的文档,您可以使用MERGE强制重新整理未排序的数据:

According to the currently published documentation, you can force a re-clustering of unsorted data with MERGE:

随着时间的流逝,随着越来越多的操作修改表,对数据进行排序的程度开始减弱,并且对该表进行了部分排序.在部分排序的表中,与完全排序的表相比,使用群集列的查询可能需要扫描更多的块.您可以通过运行SELECT *查询来重新聚集整个表中的数据,该查询从表(或表中的任何特定分区)中选择并覆盖该表.此外,可以使用DML MERGE语句重新聚集表的任意部分.

Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted. In a partially sorted table, queries that use the clustering columns may need to scan more blocks compared to a table that is fully sorted. You can re-cluster the data in the entire table by running a SELECT * query that selects from and overwrites the table (or any specific partition in it). In addition, any arbitrary portion of the table can be re-clustered using a DML MERGE statement.

  • https://cloud.google.com/bigquery/docs/clustered-tables
    • https://cloud.google.com/bigquery/docs/clustered-tables
    • 这篇关于为什么今天的群集/分区上的查询成本比以前的日期高得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆