有多个 sortkey 列是什么意思? [英] What does it mean to have multiple sortkey columns?
问题描述
Redshift 允许将多个列指定为 SORTKEY
列,但大多数最佳实践文档的编写方式就好像只有一个 SORTKEY.
Redshift allows designating multiple columns as SORTKEY
columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.
如果我用 SORTKEY (COL1, COL2)
创建一个表,这是否意味着所有列都按 COL1 排序,然后按 COL2 排序?或者,因为它是一个列式存储,所以每一列都以不同的顺序存储?IE.COL1按COL1顺序,COL2按COL2顺序,其他列无序?
If I create a table with SORTKEY (COL1, COL2)
, does that mean that all columns are stored sorted by COL1, then COL2? Or maybe, since it is a columnar store, each column gets stored in a different order? I.e. COL1 in COL1 order, COL2 in COL2 order, and the other columns unordered?
我的情况是我有一个表(其中包括)一个 type_id 和一个时间戳列.数据大致按时间戳顺序到达.大多数查询都受到 type_id 和时间戳的连接/限制.通常 type_id 子句更具体,这意味着通过查看 type_id 子句比查看时间戳子句可以排除更大比例的行.由于这个原因,type_id 是 DISTKEY.我试图了解 SORTKEY (type_id)
、SORTKEY (stamp)
、SORTKEY (type_id,stamp)
、<代码>SORTKEY (stamp,type_id).
My situation is that I have a table with (among others) a type_id and a timestamp column. Data arrives roughly in timestamp order. Most queries are joined against / restricted by both type_id and timestamp. Usually the type_id clauses are more specific, meaning a much larger percentage of rows can be excluded by looking at the type_id clause than by looking at the timestamp clause. type_id is the DISTKEY for this reason. I'm trying to understand the pros and cons of SORTKEY (type_id)
, SORTKEY (stamp)
, SORTKEY (type_id,stamp)
, SORTKEY (stamp,type_id)
.
谢谢.
推荐答案
如果声明了SORTKEY(COL1, COL2)
,则所有列都将按COL1
排序,然后COL2
就好像 ORDER BY (COL1, COL2)
已完成.
If you declare SORTKEY(COL1, COL2)
, all columns will be sorted by COL1
, then COL2
as if ORDER BY (COL1, COL2)
was done.
如果您使用 SORTKEY
来加速 JOIN,AFAIU 只要您在将要连接的表上使用相同的 SORTKEY
就无所谓了,因为发生的是合并连接.
If you are using SORTKEY
to speed up a JOIN, AFAIU it doesn't matter so long as you use the same SORTKEY
on the tables that will be joined because what happens is a merge join.
如果 COL1
像您的 type_id
一样具有高度选择性,则意味着只有少量行具有相同的 type_id
.因此,尽管您可以向 SORTKEY 中添加另一列,但它的效用是有限的,因为大部分行消除已经发生.
If COL1
is highly selective like your type_id
, it means there are only small numbers of rows which has the same type_id
. Therefore although you can add another column to SORTKEY, its utility is limited because most of the row elimination has already happened.
如果 COL1
不像您的 stamp
那样具有高度选择性(顺便说一句,这有点奇怪;我原以为它比 type_id代码>?无论如何......),这意味着通过
stamp
过滤不会消除那么多行.所以声明第二个排序键更有意义.然而,这比其他方式效率低,因为提前消除行会更便宜.如果您有时按 stamp
而不是按 type_id
过滤,那么这样做可能是有意义的.
If COL1
is not highly selective like your stamp
(which is a bit weird btw; I would have expected it to be more selective than type_id
? Anyways..), it means that filtering by stamp
won't eliminate that much rows. So it makes more sense to declare a second sort key. However, this is less efficient than the other way around as eliminating rows earlier would be cheaper. If you sometimes filter by stamp
but not by type_id
, it may make sense to do this though.
这篇关于有多个 sortkey 列是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!