为什么今天集群/分区上的查询成本比以前的日期高得多? [英] Why the cost of a query on today cluster/partition is much higher than on previous dates?
问题描述
我有一个分区/集群表如下:
I have a partition/cluster table as follow:
当我运行这个查询时:
SELECT
projectId
FROM
`projectId.dataset.tables`
WHERE _PARTITIONTIME >= "2019-03-16 00:00:00" AND _PARTITIONTIME <= "2019-03-17 00:00:00"
AND projectId='myproject'
GROUP BY
projectId
limit 1
我看到 597 MB
但是,当我在前一天运行相同的查询时:
However, When I run the same query on the previous day as follow:
SELECT
projectId
FROM
`projectId.dataset.tables`
WHERE _PARTITIONTIME >= "2019-03-15 00:00:00" AND _PARTITIONTIME <= "2019-03-16 00:00:00"
AND projectId='myproject'
GROUP BY
projectId
limit 1
我看到 122 MB
注意:如果我添加更多列,结果会更糟.
Note: The results are even worse if I add more columns.
为了确保我的分区大小相同,我计算了每个分区中 projectId 的数量
To make sure my partition has the same size I counted the number of projectId in each partition
SELECT _partitionTime as date, count(projectId) as count
FROM
`projectId.dataset.tables`
WHERE _PARTITIONTIME >= "2019-03-15 00:00:00" AND _PARTITIONTIME <= "2019-03-17 00:00:00"
GROUP BY
date
正如你所看到的,今天分区的行数比前两天还要少
And as you can see today partition has even fewer rows than the previous 2 days
此外,我尝试使用此查询查询流缓冲区,但未返回任何结果
In addtion I tried to query the streaming buffer using this query which returned no result
SELECT projectId FROM `projectId.dataset.tables`
WHERE _PARTITIONTIME IS NULL
我的结论是流缓冲区正在影响集群表上的查询成本,但我不确定这怎么可能以及为什么.
My conclusion is that the streaming buffer is impacting the cost of the query on a cluster table but I'm not sure how can that be and why.
关于这里发生了什么以及为什么在查询今天的分区时我看到更高的成本的任何想法
Any ideas on what is going on here and why do I see higher cost when querying today partition
推荐答案
当您对表进行聚类时,您基本上是在选择如何在存储时对其进行物理排序.
When you cluster a table, you're basically choosing how to physically sort it while stored.
当您流式传输到表中时,新行将大致按照接收到的顺序存储,因此违反了聚类的物理排序"承诺.
When you stream into a table, new rows are stored roughly in the order received, hence breaking the "physically sorted" promise of clustering.
BigQuery 应该足够智能,可以偶尔对您的聚簇表进行静默重新排序,但如果该过程没有运行,您将看不到聚簇的好处.
BigQuery should be smart enough to silently re-order your clustered tables once in a while, but if that process has not run, you won't see the benefits of clustering.
根据当前发布的文档,您可以使用 MERGE
强制重新聚类未排序的数据:
According to the currently published documentation, you can force a re-clustering of unsorted data with MERGE
:
随着时间的推移,随着越来越多的操作修改一个表,数据的排序程度开始减弱,表变得部分排序.在部分排序的表中,与完全排序的表相比,使用聚集列的查询可能需要扫描更多块.您可以通过运行 SELECT * 查询来重新集群整个表中的数据,该查询从表(或其中的任何特定分区)中进行选择和覆盖.此外,可以使用 DML MERGE 语句对表的任意部分进行重新聚类.
Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted. In a partially sorted table, queries that use the clustering columns may need to scan more blocks compared to a table that is fully sorted. You can re-cluster the data in the entire table by running a SELECT * query that selects from and overwrites the table (or any specific partition in it). In addition, any arbitrary portion of the table can be re-clustered using a DML MERGE statement.
- https://cloud.google.com/bigquery/docs/clustered-tables
这篇关于为什么今天集群/分区上的查询成本比以前的日期高得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!