距离和排序键Redshift [英] Dist and Sort Keys Redshift

查看:149
本文介绍了距离和排序键Redshift的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在distshift中将dist和sort键添加到某些表中.

I'm trying to add dist and sort keys to some of the tables in redshift.

我注意到,在添加表格之前,表格的大小为0.50,在添加表格之后,表格的大小增加为0.51或0.52.这可能吗 ?拥有dist和sort键的整个目的是减小表的大小并帮助提高读取/写入性能.

I notice that before adding the size of the table is 0.50 and after adding it gets increased to 0.51 or 0.52. Is this possible ? The whole purpose of having dist and sort keys is to decrease the size of the table and help in increasing the read/write performance.

推荐答案

这不是拥有DISTKEYSORTKEY的目的.

要减小表的存储大小,请使用压缩.

To decrease the storage size of a table, use compression.

DISTKEY 用于在切片之间分配数据.通过在同一片上共同定位信息,查询可以运行得更快.例如,如果您有以下表格:

The DISTKEY is used to distribute data amongst slices. By co-locating information on the same slice, queries can run faster. For example, if you had these tables:

  • customer表,DISTKEY = customer_id
  • invoices表,DISTKEY = customer_id
  • customer table, DISTKEY = customer_id
  • invoices table, DISTKEY = customer_id

...然后这些表将以相同的方式分发.给定customer_id的两个表中的所有记录将位于同一片上,从而避免了在片之间传输数据的需要. DISTKEY应该是最常用于 JOINS 的列.

...then these tables would be distributed in the same manner. All records in both tables for a given customer_id would be located on the same slice, thereby avoiding the need to transfer data between slices. The DISTKEY should be the column that is mostly used for JOINS.

SORTKEY 用于对磁盘上的数据进行排序,这是 Zone Maps 的优势.磁盘上的每个存储块的大小均为1MB,并且在一个表中仅包含一列的数据.对该列的数据进行排序,然后存储在多个块中.与每个区块关联的区域地图会标识存储在该区块中的最小最大值.然后,当使用 WHERE 语句运行查询时,Amazon Redshift仅需要读取包含所需数据范围的数据块.通过跳过WHERE子句中不包含数据的块,Redshift可以更快地运行查询.

The SORTKEY is used to sort data on disk, for the benefit of Zone Maps. Each storage block on disk is 1MB in size and contains data for only one column in one table. The data for this column is sorted, then stored in multiple blocks. The Zone Map associated with each block identifies the minimum and maximum values stored within that block. Then, when a query is run with a WHERE statement, Amazon Redshift only needs to read the blocks that contain the desired range of data. By skipping over blocks that do not contain data within the WHERE clause, Redshift can run queries much faster.

以上内容可以一起工作.例如,压缩数据需要较少的块,这也使Redshift可以基于区域地图"跳过更多数据.为了从查询中获得最佳性能,请同时使用DISTKEY,SORTKEY和压缩.

The above can all work together. For example, compressed data requires fewer blocks, which also allows Redshift to skip over more data based on the Zone Maps. To get the best possible performance out of queries, use DISTKEY, SORTKEY and compression together.

(通常建议不要压缩SORTKEY列,因为它会导致从单个块加载太多行.)

(It is often recommended not to compress the SORTKEY column because it causes too many rows to be loaded from a single block.)

另请参见: Amazon Redshift的十大性能调优技术

这篇关于距离和排序键Redshift的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆