将排序键的顺序更改为降序 [英] Change order of sortkey to descending

查看:87
本文介绍了将排序键的顺序更改为降序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个2节点Redshift集群,其中的表包含大约1亿条记录.我们将时间戳列标记为sortkey-因为查询始终受时间限制.但是,我们的用例要求结果必须按降序排序(在sortkey上).

We have a 2-node Redshift cluster with a table with around 100M records. We marked a timestamp column as the sortkey - because the queries are always time restricted. However, our use-case requires the results has to be sorted in the descending order (on the sortkey).

经过一些基准测试后,我们注意到平均时间大约为10秒.但是,取消反向顺序后,平均时间降至1秒以下.

After some benchmarking, we noticed that the average time taken around 10s. However, when the reverse ordering was removed, the average time came down to under 1s.

是否可以将sortkey的顺序颠倒为降序?官方文档似乎并未表明这是可能的.但是我尝试在创建新表时放这个:

Is it possible to reverse the order of the sortkey to be of descending order? The official documentation doesn't seem to indicate that is possible. However I tried putting this while creating a new table:

sortkey(start_time DESC)

没有错误,但似乎没有任何作用.

There were no errors but it doesn't seem to have any effect.

在查询中添加了EXPLAIN语句的结果.

Added the result of EXPLAIN statement on the queries.

  • 使用order_by ASC查询

  • The query with order_by ASC

  explain select * from kcdr_sr_desc where user_id=396747 and start_time > '2016-01-01' and start_time < '2016-07-01' order by start_time limit 20;

结果:

    XN Limit  (cost=0.00..10.86 rows=20 width=300)
    ->  XN Merge  (cost=0.00..709235.56 rows=1306585 width=300)
       Merge Key: start_time
        ->  XN Network  (cost=0.00..709235.56 rows=1306585 width=300)
       Send to leader
       ->  XN Seq Scan on kcdr_sr_desc  (cost=0.00..709235.56 rows=1306585 width=300)
             Filter: ((user_id = 396747) AND (start_time > '2016-01-01 00:00:00'::timestamp without time zone) AND (start_time < '2016-07-01 00:00:00'::timestamp without time zone))

  • 带有order_by DESC的查询

  • The query with order_by DESC

     explain select * from kcdr_sr_desc where user_id=396747 and start_time > '2016-01-01' and start_time < '2016-07-01' order by start_time desc limit 20
    

    结果:

      XN Limit  (cost=1000000841967.42..1000000841967.47 rows=20 width=300)
      ->  XN Merge  (cost=1000000841967.42..1000000845233.88 rows=1306585 width=300)
      Merge Key: start_time
         ->  XN Network  (cost=1000000841967.42..1000000845233.88 rows=1306585 width=300)
           Send to leader
           ->  XN Sort  (cost=1000000841967.42..1000000845233.88 rows=1306585 width=300)
                 Sort Key: start_time
                 ->  XN Seq Scan on kcdr_sr_desc  (cost=0.00..709235.56 rows=1306585 width=300)
                       Filter: ((user_id = 396747) AND (start_time > '2016-01-01 00:00:00'::timestamp without time zone) AND (start_time < '2016-07-01 00:00:00'::timestamp without time zone))
    

  • 推荐答案

    Amazon Redshift表上的SORTKEY用于通过使用区域地图来提高查询效率.它并非旨在对数据进行排序以匹配查询.

    The SORTKEY on an Amazon Redshift table is used to improve efficiency of queries via the use of Zone Maps. It is not intended to sort data to match queries.

    Amazon Redshift以1MB的块将数据存储在磁盘上.每个块包含与一个表的一列有关的数据,并且来自该列的数据可以占用多个块.块可以压缩,因此它们通常包含超过1MB的数据.

    Amazon Redshift stores data on disk in 1MB blocks. Each block contains data relating to one column of one table, and data from that column can occupy multiple blocks. Blocks can be compressed, so they will typically contain more than 1MB of data.

    磁盘上的每个块都有一个关联的 Zone Map (区域映射),该映射标识该块中要存储的列的最小值和最大值.这使Redshift能够跳过不包含相关数据的块.例如,如果SORTKEY是时间戳,而查询中有一个WHERE子句将数据限制为特定日期,则Redshift可以跳过所需日期不在该日期之内的任何日期.

    Each block on disk has an associated Zone Map that identifies the minimum and maximum value in that block for the column being stored. This enables Redshift to skip over blocks that do not contain relevant data. For example, if the SORTKEY is a timestamp and a query has a WHERE clause that limits data to a specific day, then Redshift can skip over any blocks where the desired date is not within that block.

    一旦Redshift找到具有所需数据的块,它将读取这些块以执行查询.

    Once Redshift locates the blocks with desired data, it will read those blocks to execute the query.

    在查看您的EXPLAIN计划时,第二个示例显然具有一个额外的SORT步骤.看起来查询优化器知道您正在对与SORTKEY匹配的列进行排序,因此在第一个示例中跳过了排序.在表中将数据附加附加到表中,这很常见,从而导致更新的数据出现在列的末尾.

    In looking at your EXPLAIN plans, the second example clearly has an extra SORT step. It would appear that the query optimizer knows that you are sorting on a column that matches the SORTKEY, so that the sort is skipped in the first example. This is common where data is appended to tables, resulting in more recent data appearing at the end of columns.

    一些适合您的选项:

    • 如果您一直在进行新的数据加载,则可以添加一个与日期相反的新列(例如,从3000年开始的间隔减去存储的日期) .将其用作SORTKEY,数据将反向存储.实际日期列的区域地图"也将方便地反向排序.

    • If you are always doing a fresh load of data, you could add a new column that is the reverse of the date (eg an interval from the year 3000 minus the stored date). Use this as the SORTKEY and the data will be stored in reverse. The Zone Maps for the real date column will be conveniently reverse-sorted too.

    如果您继续加载新数据,则最新数据将附加到列的末尾,因此无论如何您都无法轻松保持数据的反向排序.

    If you are continually loading new data, the latest data will be appended to the end of the columns, so you can't easily keep data reverse sorted anyway.

    使用较小的日期范围.上面的查询将结果限制在六个月的范围内,返回1,306,585行.然后,查询将对它进行排序,从而将结果限制为最新的20.如果减小日期范围(例如,仅减少到一天),则将检索到较少的数据,排序将运行得更快,并且查询将更快.鉴于有很多行,平均每天有7000多个记录,因此对于LIMIT 20来说应该绰绰有余.

    Use a smaller date range. The above query is restricting results to a range of six months, returning 1,306,585 rows. The query then sorts it an limits the result to the most recent 20. If you reduce the date range (eg to just one day), the less data will be retrieved, the sort will run faster and the query will be quicker. Given that many rows, there is an average of 7000+ records per day, so that should be more than sufficient for LIMIT 20.

    请勿使用SELECT * -它将导致从磁盘读取更多的块(因为每一列都存储在单独的块中).通过仅查询实际需要的列,磁盘访问量将减少,并且查询将运行得更快.

    Do not use SELECT * -- it causes more blocks to be read from disk (since each column is stored in a separate block). By querying only the columns actually required, there will be less disk access and the query will run faster.

    这篇关于将排序键的顺序更改为降序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆