BigQuery-丛集表格无法减少多个键的查询大小 [英] BigQuery - Clustered tables not reducing query size with multiple keys

查看:82
本文介绍了BigQuery-丛集表格无法减少多个键的查询大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试优化BigQuery中的查询费用,并且一直在尝试集群表. 供参考: BigQuery -仅查询具有键值模式的表中的键的子集

I am trying to optimize my querying in BigQuery for cost, and I have been trying out Clustered tables. For reference: BigQuery - querying only a subset of keys in a table with key value schema

通过单列对表进行聚类可以成功减小查询量.但是,请使用多个列(示例显示在 https://cloud中. google.com/bigquery/docs/querying-clustered-tables#sample_table_used_in_the_examples )不会导致查询大小的减少.

Clustering the table by a single column is successfully reducing my query size. However, using multiple columns (example shown in: https://cloud.google.com/bigquery/docs/querying-clustered-tables#sample_table_used_in_the_examples) is not leading to any reduction in query size.

要使用文档中给出的示例,

To use the example given in the documentation,

SELECT
  SUM(totalSale)
FROM
  mydataset.ClusteredSalesData
WHERE
  customer_id = 10000
  AND product_id LIKE 'gcp_analytics%'

如果表上没有集群,这将查询整个数据集(例如100GB),当仅由customer_id集群时,减少到约10GB(在实际运行后看到,而不是在验证器上看到)当同时由customer_id和product_id聚集时(即使在实际运行查询之后).

This queries the entire data set (say, 100GB) if there was no clustering on the table, reduces to about 10GB (seen after actual run, not at the validator) when clustered only by customer_id, but does not change at all when clustered by both customer_id and product_id (even after actual run of the query).

我尝试更改集群的顺序,WHERE子句的顺序等.似乎没有任何改变.

I have tried changing the order of the clustering, order of the WHERE clauses, etc. Nothing seems to change anything.

这是预期的行为吗?在BigQuery上出错?还是我做错了什么?

Is this expected behavior? Bug on BigQuery? Or am I doing something wrong?

更新: 感谢@ Pentium10向我指出: https://medium. com/@ hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b

UPDATE: Thanks to @Pentium10 for pointing me to: https://medium.com/@hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b

要在以下两个查询中使用博客文章中的示例,

To use the examples from the blogpost, among the following two queries,

第一季度:

SELECT wiki, SUM(views) views
FROM fh-bigquery.wikipedia_v3.pageviews_2017
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
--AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki

第二季度:

SELECT wiki, SUM(views) views
FROM fh-bigquery.wikipedia_v3.pageviews_2017
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'en'
AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki

由于群集是按(维基,标题)进行的,所以我预计第二季度会便宜一些,但是事实并非如此.

I would have expected Q2 to be cheaper since clustering is by (wiki, title), but that does not seem to be the case.

推荐答案

我基于 Pentium10 :

SELECT wiki, SUM(views) views 
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30' 
AND wiki = 'en' 
AND title = 'Barcelona' 
GROUP BY wiki ORDER BY wiki 

已处理

180.19GB(根据validator). 10.3GB已处理运行查询.

180.19GB processed (according to the validator). 10.3GB processed running the query.

SELECT wiki, SUM(views) views 
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30' 
AND wiki = 'en' 
--AND title = 'Barcelona' 
GROUP BY wiki ORDER BY wiki 

已处理86.1GB(根据validator). 已运行18.4GB运行查询.

86.1GB processed (according to the validator). 18.4GB processed running the query.

SELECT wiki, SUM(views) views 
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30' 
-- AND wiki = 'en' 
AND title = 'Barcelona' 
GROUP BY wiki ORDER BY wiki 

已处理

180.19GB(根据validator). 113.85GB已处理运行查询.

180.19GB processed (according to the validator). 113.85GB processed running the query.

正如霍法先生所说,一切看起来都是连贯的,因为簇表的顺序很重要"(维基"比标题"节省的更多).

Everything looks coherent since, as Mr. Hoffa said, "order matters" for clustered tables ('wiki' saves more than 'title').

是真的,验证器仍无法正常工作,但clustered tables仍然仍在beta ,因此我们可以期待将来会有所改善.

Is true that the validator is still not properly working but clustered tables are still on beta, so we can expect an improvement in the future.

这篇关于BigQuery-丛集表格无法减少多个键的查询大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆