有多个 sortkey 列是什么意思? [英] What does it mean to have multiple sortkey columns?

查看:31
本文介绍了有多个 sortkey 列是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Redshift 允许将多个列指定为 SORTKEY 列,但大多数最佳实践文档的编写方式就好像只有一个 SORTKEY.

Redshift allows designating multiple columns as SORTKEY columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.

如果我用 SORTKEY (COL1, COL2) 创建一个表,这是否意味着所有列都按 COL1 排序,然后按 COL2 排序?或者,因为它是一个列式存储,所以每一列都以不同的顺序存储?IE.COL1按COL1顺序,COL2按COL2顺序,其他列无序?

If I create a table with SORTKEY (COL1, COL2), does that mean that all columns are stored sorted by COL1, then COL2? Or maybe, since it is a columnar store, each column gets stored in a different order? I.e. COL1 in COL1 order, COL2 in COL2 order, and the other columns unordered?

我的情况是我有一个表(其中包括)一个 type_id 和一个时间戳列.数据大致按时间戳顺序到达.大多数查询都受到 type_id 和时间戳的连接/限制.通常 type_id 子句更具体,这意味着通过查看 type_id 子句比查看时间戳子句可以排除更大比例的行.由于这个原因,type_id 是 DISTKEY.我试图了解 SORTKEY (type_id)SORTKEY (stamp)SORTKEY (type_id,stamp)、<代码>SORTKEY (stamp,type_id).

My situation is that I have a table with (among others) a type_id and a timestamp column. Data arrives roughly in timestamp order. Most queries are joined against / restricted by both type_id and timestamp. Usually the type_id clauses are more specific, meaning a much larger percentage of rows can be excluded by looking at the type_id clause than by looking at the timestamp clause. type_id is the DISTKEY for this reason. I'm trying to understand the pros and cons of SORTKEY (type_id), SORTKEY (stamp), SORTKEY (type_id,stamp), SORTKEY (stamp,type_id).

谢谢.

推荐答案

如果声明了SORTKEY(COL1, COL2),则所有列都将按COL1排序,然后COL2 就好像 ORDER BY (COL1, COL2) 已完成.

If you declare SORTKEY(COL1, COL2), all columns will be sorted by COL1, then COL2 as if ORDER BY (COL1, COL2) was done.

如果您使用 SORTKEY 来加速 JOIN,AFAIU 只要您在将要连接的表上使用相同的 SORTKEY 就无所谓了,因为发生的是合并连接.

If you are using SORTKEY to speed up a JOIN, AFAIU it doesn't matter so long as you use the same SORTKEY on the tables that will be joined because what happens is a merge join.

如果 COL1 像您的 type_id 一样具有高度选择性,则意味着只有少量行具有相同的 type_id.因此,尽管您可以向 SORTKEY 中添加另一列,但它的效用是有限的,因为大部分行消除已经发生.

If COL1 is highly selective like your type_id, it means there are only small numbers of rows which has the same type_id. Therefore although you can add another column to SORTKEY, its utility is limited because most of the row elimination has already happened.

如果 COL1 不像您的 stamp 那样具有高度选择性(顺便说一句,这有点奇怪;我原以为它比 type_id?无论如何......),这意味着通过 stamp 过滤不会消除那么多行.所以声明第二个排序键更有意义.然而,这比其他方式效率低,因为提前消除行会更便宜.如果您有时按 stamp 而不是按 type_id 过滤,那么这样做可能是有意义的.

If COL1 is not highly selective like your stamp (which is a bit weird btw; I would have expected it to be more selective than type_id? Anyways..), it means that filtering by stamp won't eliminate that much rows. So it makes more sense to declare a second sort key. However, this is less efficient than the other way around as eliminating rows earlier would be cheaper. If you sometimes filter by stamp but not by type_id, it may make sense to do this though.

这篇关于有多个 sortkey 列是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆