为什么 Redshift 不需要物化视图或索引? [英] Why does Redshift not need materialized views or indexes?

查看:55
本文介绍了为什么 Redshift 不需要物化视图或索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Redshift FAQ 下的

问:与大多数传统的数据仓库和分析数据库相比,Amazon Redshift 的性能如何?

Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?

内容如下:

高级压缩:与基于行的数据存储相比,列式数据存储的压缩率要高得多,因为类似的数据按顺序存储在磁盘上.Amazon Redshift 采用多种压缩技术,并且通常可以实现相对于传统关系数据存储的显着压缩.此外,Amazon Redshift 不需要索引或物化视图,因此比传统关系数据库系统使用更少的空间.将数据加载到空表中时,Amazon Redshift 会自动对您的数据进行采样并选择最合适的压缩方案.

Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.

为什么会这样?

推荐答案

说实话(在我看来)有点虚伪.尽管 RedShift 没有这些,但我不确定这是否等于说它不会从它们中受益.

It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.

物化视图

我不知道他们为什么提出这个要求.可能是因为他们认为引擎的性能如此之好,以至于拥有它们的收益微乎其微.

I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.

我会对此提出异议,我从事的产品维护自己的物化视图,并且可以通过这样做显示出显着的性能提升.也许 AWS 认为我一定是做错了什么?

I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?

索引

RedShift 没有索引.

RedShift does not have indexes.

它确实有 SORT ORDER,这与聚集索引异常相似.它只是一个字段列表,数据按其排序(如复合聚集索引).

It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).

它最近甚至引入了INTERLEAVED SORT KEYS.这是对多个独立排序顺序的直接尝试.不是按 a THEN b THEN c 排序,而是有效按每个同时排序.

It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.

由于 RedShift 实现其列存储的方式,这成为可能.
- 每列都与其他列分开存储
- 每列存储在 1MB 块中
- 每个 1MB 区块都有汇总统计

That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics

除了作为存储模式之外,这还有效地变成了一组伪索引.
- 如果数据按 a then b then x
排序- 但你想要 z = 1234
- RedShift 首先查看块统计信息(对于列 z)
- 这些统计数据将说明该块存储的最小值和最大值
- 这允许 Redshift 在某些情况下跳过其中的许多块
- 这个实习生允许 RedShift 识别从其他列读取哪些块

As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns

这篇关于为什么 Redshift 不需要物化视图或索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆