为什么Redshift需要进行全表扫描以找到DIST/SORT键的最大值? [英] Why does Redshift need to do a full table scan to find the max value of the DIST/SORT key?

查看:276
本文介绍了为什么Redshift需要进行全表扫描以找到DIST/SORT键的最大值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Redshift上进行简单的测试,以尝试加快将数据插入Redshift表的速度.我今天注意到的一件事是做这样的事情

CREATE TABLE a (x int) DISTSTYLE key DISTKEY (x) SORTKEY (x);
INSERT INTO a (x) VALUES (1), (2), (3), (4);
VACUUM a; ANALYZE a;

EXPLAIN SELECT MAX(x) FROM a;

收益

QUERY PLAN
XN Aggregate  (cost=0.05..0.05 rows=1 width=4)
  ->  XN Seq Scan on a  (cost=0.00..0.04 rows=4 width=4)

我知道这只有4行,但是仍然不应该进行全表扫描以查找预排序列的最大值.那不是元数据包含在ANALYZE完成的工作中吗?

和健全性检查一样,EXPLAINEXPLAIN仅扫描2行,而不是整个表.

我向表中插入了1,000,000多行,其随机值从1到10,000.做一个真空和分析.查询计划仍然说它必须扫描所有1,000,004行.

解决方案

分析微小数据集中的查询计划不会对数据库如何执行查询产生任何实际的了解.

优化器具有阈值,并且当不同计划之间的成本差异足够小时,它将停止考虑替代计划.这个想法是,对于简单的查询,搜索完美"执行计划所花费的时间可能会超过不太理想的计划的总执行时间.

Redshift已在ParAccel DB的代码上开发.实际上,ParAccel具有数百个可以更改/调整的参数,以针对不同的工作负载/情况优化数据库.

由于Redshift是托管"产品,因此在预期"工作量的前提下,将这些设置预设为Amazon工程师认为最佳的水平.

通常,Redshift和ParAccel对于单切片查询而言并不是那么好.即使只在单个切片中查找数据,这些查询仍倾向于在所有切片中运行.

在切片中执行查询后,读取的最小数据量为一个块.根据块大小,这可能意味着数十万行.

请记住,Redshift没有索引.因此,您将不会有一个简单的记录查找,该查找将从索引中读取一些条目,然后将注意力集中在磁盘上的单个页面上.它将始终至少读取该表的整个块,并将在每个切片中读取.


如何拥有有意义的数据集以能够评估查询计划?

简短的答案是,您的表中每个切片将具有大量"的数据块.

我的表需要每片多少块?答案取决于几个因素:

  1. 集群中的节点数
  2. 集群中节点的类型-每个节点的切片数
  3. 数据类型-每个值需要多少字节.
  4. 与 询问.最佳编码取决于数据受众特征

所以让我们从顶部开始.

Redshift是一个MPP数据库,处理过程分布在多个节点上.在此处查看Redshift的体系结构.

每个节点进一步细分为多个切片,这些切片是专用的数据分区和相应的硬件资源,用于处理对该数据分区的查询.

在Redshift中创建表并插入数据时,Redshift将为每个片分配最少一个块.


这是一个简单的例子:

如果您创建的群集具有两个ds1.8xlarge节点,则每个节点将有16个切片乘以两个节点,总共有32个切片.

假设我们正在查询,并且WHERE子句中的列类似于"ITEM_COUNT"一个整数.整数消耗4个字节.

Redshift使用的块大小为1MB.

因此,在这种情况下,您的ITEM_COUNT列将至少具有32个块乘以1MB的块大小,这等于32MB的存储空间.

如果您拥有32MB的存储空间,并且每个条目仅占用4个字节,则您可以拥有超过800万个条目,并且它们都可以容纳在单个块中.

在此示例中,Amazon Redshift文档中的加载量接近4000万行以评估和比较不同的编码技术.在这里阅读.


但是等等.....

存在压缩,如果您的压缩率为75%,则意味着即使有3200万条记录也仍然可以放入该单个块中.

底线是什么?

为了分析您的查询计划,您需要具有多个块的表,列.在我们上面的示例中,3200万行仍然是单个块.

这意味着,在上述配置中,考虑所有假设,具有单个记录的表基本上很可能与具有3200万条记录的表具有相同的查询计划,因为在这两种情况下,数据库都只需要读取每个切片一个块.


如果您想了解数据如何在多个切片之间分布以及正在使用多少块,可以使用以下查询:

每个切片有多少行:

Select trim(name) as table_name, id, slice, sorted_rows, rows
from stv_tbl_perm
where name like '<<your-tablename>>'
order by slice;

如何计算多少个区块:

select trim(name) as table_name, col,  b.slice, b.num_values, count(b.slice)
from stv_tbl_perm a, stv_blocklist b
where a.id = b.tbl
  and a.slice = b.slice
and name like '<<your-tablename>>'
group by 1,2,3,4
order by col, slice;

I'm doing simple tests on Redshift to try and speed up the insertion of data into a Redshift table. One thing I noticed today is that doing something like this

CREATE TABLE a (x int) DISTSTYLE key DISTKEY (x) SORTKEY (x);
INSERT INTO a (x) VALUES (1), (2), (3), (4);
VACUUM a; ANALYZE a;

EXPLAIN SELECT MAX(x) FROM a;

yields

QUERY PLAN
XN Aggregate  (cost=0.05..0.05 rows=1 width=4)
  ->  XN Seq Scan on a  (cost=0.00..0.04 rows=4 width=4)

I know this is only 4 rows, but it still shouldn't be doing a full table scan to find the max value of a pre-sorted column. Isn't that metadata included in the work done by ANALYZE?

And just as a sanity check, the EXPLAIN for SELECT x FROM a WHERE x > 3 only scans 2 rows instead of the whole table.

Edit: I inserted 1,000,000 more rows into the table with random values from 1 to 10,000. Did a vacuum and analyze. The query plan still says it has to scan all 1,000,004 rows.

解决方案

Analyzing query plans in a tiny data set does not yield any practical insight on how the database would perform a query.

The optimizer has thresholds and when the cost difference between different plans is small enough it stops considering alternative plans. The idea is that for simple queries, the time spent searching for the "perfect" execution plan, can possibly exceed the total execution time of a less optimal plan.

Redshift has been developed on the code for ParAccel DB. ParAccel has literally hundreds of parameters that can be changed/adjusted to optimize the database for different workloads/situations.

Since Redshift is a "managed" offering, it has these settings preset at levels deemed optimal by Amazon engineers given an "expected" workload.

In general, Redshift and ParAccel are not that great for single slice queries. These queries tend to be run in all slices anyway, even if they are only going to find data in a single slice.

Once a query is executing in a slice, the minimum amount of data read is a block. Depending on block size this can mean hundreds of thousand rows.

Remember, Redshift does not have indexes. So you are not going to have a simple record lookup that will read a few entries off an index and then go laser focused on a single page on the disk. It will always read at least an entire block for that table, and it will do that in every slice.


How to have a meaningful data set to be able to evaluate a query plan?

The short answer is that your table would have a "large number" of data blocks per slice.

How many blocks is per slice is my table going to require? The answer depends on several factors:

  1. Number of nodes in your cluster
  2. Type of node in the cluster - Number of slices per node
  3. Data Type - How many bytes each value requires.
  4. The type of compression encoding for the column involved in the query. The optimal encoding depends on data demographics

So let's start at the top.

Redshift is an MPP Database, where processing is spread accross multiple nodes. See Redshift's architecture here.

Each node is further sub-divided in slices, which are dedicated data partitions and corresponding hardware resources to process queries on that partition of the data.

When a table is created in Redshift, and data is inserted, Redshift will allocate a minimum of one block per slice.


Here is a simple example:

If you created a cluster with two ds1.8xlarge nodes, you would have 16 slices per node times two nodes for a total of 32 slices.

Let's say we are querying and column in the WHERE clause is something like "ITEM_COUNT" an integer. An integer consumes 4 bytes.

Redshift uses a block size of 1MB.

So in this scenario, your ITEM_COUNT column would have available to it a minimum of 32 blocks times block size of 1MB which would equate to 32MB of storage.

If you have 32MB of storage and each entry only consumes 4 bytes, you can have more than 8 million entries, and they could all fit inside of a single block.

In this example in the Amazon Redshift documentation they load close to 40 million rows to evaluate and compare different encoding techniques. Read it here.


But wait.....

There is compression, if you have a 75% compression rate, that would mean that even 32 million records would still be able to fit into that single block.

What is the bottom line?

In order to analyze your query plan you would need tables, columns that have several blocks. In our example above 32 milion rows would still be a single block.

This means that in the configuration above, with all the assumptions, a table with a single record would basically most likely have the same query plan as a table with 32 million records, because, in both cases the database only needs to read a single block per slice.


If you want to understand how your data is distributed across slices and how many blocks are being used you can use the queries below:

How many rows per slice:

Select trim(name) as table_name, id, slice, sorted_rows, rows
from stv_tbl_perm
where name like '<<your-tablename>>'
order by slice;

How to count how many blocks:

select trim(name) as table_name, col,  b.slice, b.num_values, count(b.slice)
from stv_tbl_perm a, stv_blocklist b
where a.id = b.tbl
  and a.slice = b.slice
and name like '<<your-tablename>>'
group by 1,2,3,4
order by col, slice;

这篇关于为什么Redshift需要进行全表扫描以找到DIST/SORT键的最大值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆