Redshift 查询花费太多时间 [英] Redshift Query taking too much time

查看:48
本文介绍了Redshift 查询花费太多时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Redshift 中,查询执行时间过长.有些查询会在一段时间后继续运行或中止.

In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time.

我对 Redshift 的了解非常有限,并且越来越难以理解优化查询的查询计划.

I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query.

共享我们运行的查询之一以及查询计划.执行查询需要 20 秒.

Sharing one of the queries that we run, along with the Query Plan. The query is taking 20 seconds to execute.

查询

SELECT
    date_trunc('day',
    ti) as date,
    count(distinct deviceID) AS COUNT    
FROM
    live_events
WHERE
    brandID = 3927
    AND ti >= '2017-08-02T00:00:00+00:00'
    AND ti <= '2017-09-02T00:00:00+00:00'
GROUP BY
    1  

主键
品牌标识

交错排序键
我们已将以下列设置为交错排序键 -
品牌 ID、ti、event_name

Interleaved Sort Keys
we have set following columns as interleaved sort keys -
brandID, ti, event_name

查询计划

推荐答案

该表中有 1.26 亿行.在单个 dc1.large 节点上需要一秒钟以上的时间.

You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.

以下是一些可以提高性能的方法:

Here's some ways you could improve the performance:

更多节点

在更多节点上传播数据允许更多的并行化.每个节点都增加了额外的处理和存储.即使您的数据量仅能证明一个节点的合理性,但如果您想要更高的性能,请添加更多节点.

Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.

排序键

对于正确的查询类型,SORTKEY 可能是提高查询速度的最佳方式.对磁盘上的数据进行排序允许 Redshift 跳过它知道不包含相关数据的块.

For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.

例如,您的查询具有 WHEREbrandID = 3927,因此将 brandID 作为 SORTKEY 将使其非常有效,因为很少有磁盘块会包含一个品牌的数据.

For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.

交错排序很少是最好的排序方法,因为它不如单个或复合排序键有效,并且需要很长时间才能进行 VACUUM.如果您显示的查询是您正在运行的查询类型的典型查询,则使用复合排序键 brandId, titi,brandId.效率会更高.

Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.

SORTKEYs 通常是一个日期列,因为它们经常出现在 WHERE 子句中,如果数据总是按时间顺序附加,表将自动排序.

SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.

交错排序会导致 Redshift 读取更多磁盘块以查找数据,从而显着增加查询时间.

The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.

DISTKEY

DISTKEY 通常应设置为表的 JOIN 语句中最常用的字段.这是因为与同一 DISTKEY 值相关的数据存储在同一片上.这不会对单节点集群产生如此大的影响,但仍然值得正确处理.

The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.

同样,您只显示了一种类型的查询,因此很难推荐 DISTKEY.仅基于此查询,我会推荐 DISTKEY EVEN 以便所有切片都参与查询.(如果没有选择特定的 DISTKEY,它也是默认的 DISTKEY.)或者,将 DISTKEY 设置为未显示的字段 - 但当然不要使用 brandId 作为 DISTKEY 否则只有一个切片将参与显示的查询.

Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.

真空

定期清理您的表,以便数据按 SORTKEY 顺序存储,并从存储中删除已删除的数据.

VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.

实验!

最佳设置取决于您的数据和您通常运行的查询.执行一些测试以比较 SORTKEY 和 DISTKEY 值并选择性能最佳的设置.然后,在 3 个月后再次进行测试,看看您的查询或数据是否发生了足以使其他设置更高效的变化.

Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.

这篇关于Redshift 查询花费太多时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆