拥有多个sortkey列是什么意思? [英] What does it mean to have multiple sortkey columns?

查看:157
本文介绍了拥有多个sortkey列是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Redshift允许将多个列指定为SORTKEY列,但是大多数最佳实践文档的编写都好像只有一个SORTKEY.

Redshift allows designating multiple columns as SORTKEY columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.

如果我用SORTKEY (COL1, COL2)创建一个表,是否意味着所有列都按COL1,然后按COL2排序存储?又或者,由于它是一列存储,所以每一列都以不同的顺序存储? IE.

If I create a table with SORTKEY (COL1, COL2), does that mean that all columns are stored sorted by COL1, then COL2? Or maybe, since it is a columnar store, each column gets stored in a different order? I.e. COL1 in COL1 order, COL2 in COL2 order, and the other columns unordered?

我的情况是,我有一个带有type_id和timestamp列的表.数据大致按时间戳顺序到达.大多数查询都是针对type_id和timestamp进行限制的.通常,type_id子句更具体,这意味着通过查看type_id子句可以比通过查看timestamp子句排除更多百分比的行.因此,type_id是DISTKEY.我试图了解SORTKEY (type_id)SORTKEY (stamp)SORTKEY (type_id,stamp)SORTKEY (stamp,type_id)的优缺点.

My situation is that I have a table with (among others) a type_id and a timestamp column. Data arrives roughly in timestamp order. Most queries are joined against / restricted by both type_id and timestamp. Usually the type_id clauses are more specific, meaning a much larger percentage of rows can be excluded by looking at the type_id clause than by looking at the timestamp clause. type_id is the DISTKEY for this reason. I'm trying to understand the pros and cons of SORTKEY (type_id), SORTKEY (stamp), SORTKEY (type_id,stamp), SORTKEY (stamp,type_id).

谢谢.

推荐答案

如果声明SORTKEY(COL1, COL2),则所有列将按COL1排序,然后按COL2进行排序,就像完成ORDER BY (COL1, COL2)一样.

If you declare SORTKEY(COL1, COL2), all columns will be sorted by COL1, then COL2 as if ORDER BY (COL1, COL2) was done.

如果您使用SORTKEY来加快JOIN和AFAIU的速度,那么只要您在将要连接的表上使用相同的SORTKEY即可,因为这是合并联接.

If you are using SORTKEY to speed up a JOIN, AFAIU it doesn't matter so long as you use the same SORTKEY on the tables that will be joined because what happens is a merge join.

如果COL1type_id一样具有高度选择性,则意味着只有很少的行具有相同的type_id.因此,尽管可以将另一列添加到SORTKEY中,但是由于大多数行消除已经发生,因此它的实用程序受到限制.

If COL1 is highly selective like your type_id, it means there are only small numbers of rows which has the same type_id. Therefore although you can add another column to SORTKEY, its utility is limited because most of the row elimination has already happened.

如果COL1不像您的stamp那样具有高度选择性(顺便说一句;我希望它比type_id更具选择性?无论如何..),这意味着按不会消除那么多行.因此,声明第二个排序键更有意义.但是,这比其他方法效率低,因为尽早消除行会更便宜.如果您有时通过stamp进行过滤,但没有通过type_id进行过滤,那么这样做可能是有道理的.

If COL1 is not highly selective like your stamp (which is a bit weird btw; I would have expected it to be more selective than type_id? Anyways..), it means that filtering by stamp won't eliminate that much rows. So it makes more sense to declare a second sort key. However, this is less efficient than the other way around as eliminating rows earlier would be cheaper. If you sometimes filter by stamp but not by type_id, it may make sense to do this though.

这篇关于拥有多个sortkey列是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆