平面表的 Redshift 性能与维度和事实 [英] Redshift Performance of Flat Tables Vs Dimension and Facts

查看:26
本文介绍了平面表的 Redshift 性能与维度和事实的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在平面 OLTP 表(不在 3NF 中)上创建维度模型.

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).

有些人认为不需要维度模型表,因为报告的大部分数据都呈现单表.但是该表包含的内容超过了我们需要的 300 列.我是否仍应将平面表分为维度和事实,还是直接在报告中使用平面表.

There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.

推荐答案

当创建纯粹用于报告目的的表时(这是典型的数据仓库),通常创建宽,带有非规范化数据的平面表,因为:

When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:

  • 查询更方便
  • 它避免了对因果用户来说可能会造成混淆和错误的 JOIN
  • 查询运行速度更快(尤其是对于使用列式数据存储的数据仓库系统)
  • It is easier to query
  • It avoids JOINs that can be confusing and error-prone for causal users
  • Queries run faster (especially for Data Warehouse systems that use columnar data storage)

这种数据格式非常适合报告,但不适合应用程序的普通数据存储——用于 OLTP 的数据库应该使用规范化表.

This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.

不要担心有大量的列——这对于数据仓库来说是很正常的.然而,300 列确实听起来相当大,并表明它们不一定被明智地使用.因此,您可能需要检查它们是否是必需的.

Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.

许多列的一个很好的例子是具有易于编写 WHERE 子句的标志,例如 WHERE customer_is_active 而不必加入另一个表并确定他们是否使用过该服务在过去 30 天内.这些列需要每天重新计算,但是查询数据非常方便.

A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.

底线:在使用数据仓库时,您应该将易用性置于性能之上.然后,找出如何使用数据仓库系统(例如 Amazon Redshift)来优化访问,该系统旨在非常有效地处理此类数据.

Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

这篇关于平面表的 Redshift 性能与维度和事实的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆