分区和全表之间的大查询性能差异 [英] difference in bigquery performance between partition and full table

查看:99
本文介绍了分区和全表之间的大查询性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在25个分区(每个40米)联盟中有一个10亿行,并在1个完整表中。
我运行一个计算不同计数的查询,通常它会在1-4个分区上找到数据。 (查询是动态的)基于where子句。
相同的查询在所有表的联合上运行30秒,而在整个表上为50秒。处理相同的GB。
首先,很好的表现:-)
问题是:
1.使用union vs 1 big table的性能 ?分区表总是更快?
2.如果它只使用几个分区,为什么它会向我收取相同的GB?这意味着我将不得不动态构建查询来选择正确的分区......这是一种负担。 (我知道你没有像优化器那样的SQL,但是如果我需要管理分区,我不应该从中受益吗?)

非常感谢

解决方案

对于您所描述的两个查询,BigQuery仍会处理所有数据。对于联合查询,数据的布局可能有点优势,但这并不意味着BigQuery的工作量会减少 - 因此您的收费标准相同。如果可以的话,就像你所建议的那样,构建一个只使用所需分区的查询,这样处理的数据就会少一些,因此成本也会更低。



很难预测无论是将所有数据都放在单个表中还是将其分布在多个表中并执行联合查询,都可以提高性能。对于这个特定的查询,这听起来像联合会更快,对于其他查询,例如可能在分区中分布的更多工作,它可能会更慢。



我认为一个经验法则是,如果您可以通过确定哪些分区将需要进行预过滤,那么您将会更好,如果只是因为你可以运行更便宜的查询。你的查询不会比较小的数据慢,而且它们往往会更快。



我还应该注意,改进查询中选择多个表的语法(例如让人们在他们的查询中指定日期范围或通配符)是我们最常要求的功能之一,我们很有可能会很快达成这一目标。你的表是如何分区的?什么可以更简单地为您的查询指定正确的表?


I have a ~1 Billion rows in a 25 partitions (40m each) union, and in 1 full table. I run a query that calculate distinct counts, usually it find the data on 1-4 partitions. ( the query is dynamic) based on a where clause. same query runs 30sec on the union of all tables, vs 50sec on the full table. same GB processed. first of all, great performance :-) the questions are: 1. what are the principals in terms of performance only to use union vs 1 big table? is partition table always faster? 2. if it uses only few partitions, why does it charge me for same GB? this mean that I will have to dynamically construct the query to choose the right partition... which is a burden. ( I understand you dont have an an SQL like optimizer, but if I need to manage partitions, shouldn't I benefit from it?)

thanks a lot

解决方案

For both of the queries you've described, BigQuery still processes all of your data. For the unioned query, the layout of the data may be somewhat advantageous, but it doesn't mean that BigQuery is doing any less work -- hence the fact that you are charged the same. If you can, as you suggested, construct a query that only uses the required partitions, this will be less data to process and therefore less expensive.

It is difficult to predict whether having all of your data in a single table or spreading it across multiple tables and doing union queries is going to improve performance. For this particular query, it sounds like union is faster, for other queries, such as ones that may be doing more work that is spread across the partitions, it might be slower.

I'd say a rule of thumb is that if you can pre-filter the data by figuring out which partitions are going to be needed, you're going to be better off, if only because you can then run less expensive queries. Your queries are unlikely to be slower over smaller data, and they may often be faster.

I should also note that improving the syntax for selecting multiple tables in a query (e.g letting people specify date ranges, or wildcards in their queries) is one of our most frequently requested features, and there is a good chance we'll get to that fairly soon. How are your tables partitioned? What would make it simpler to specify the right tables for your queries?

这篇关于分区和全表之间的大查询性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆