数据仓库-星型架构与平面表 [英] Data Warehousing - Star Schema vs Flat Table

查看:69
本文介绍了数据仓库-星型架构与平面表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为单个必需的数据存储设计一个数据仓库,这些数据包括财务系统,项目计划系统和无数的科学系统.IE.许多不同的数据集市.

我一直在阅读数据仓库和流行的方法,例如Star Schemas和Kimball方法等,但是我找不到答案的一个问题是:

为什么将DW数据集市设计为星型模式而不是单个平面表更好?

与在所有维表中进行大量小连接相比,事实和属性/维度之间没有连接是否肯定更快,更简单吗?磁盘空间不是问题,如果需要的话,我们将在数据库中放置更多磁盘.如今,星型架构是否有些过时?还是数据架构师教条?

解决方案

您的问题很好:尺寸建模的Kimball口头禅是提高性能和可用性.

但是我不认为它是过时的或教条的-对于许多情况和平台,这是一种合理,实用的方法.

关系数据库存储数据的方式意味着要在表的数量和类型,用于典型查询的数据路由,易于维护和描述数据之间的关系,联接的数量,连接的构造方式,列的可索引性等.

3NF(或更高)是频谱的一端,适合OLTP系统,而单个表是频谱的另一端.尺寸模型位于中间,并且至少在使用某些技术时才适合报告.

性能并不仅仅与联接数"有关,尽管星型模式在报告工作负载方面比完全规范化的数据库执行得更好,部分原因是联接数减少了.尺寸通常很宽.如果您在每个事实的每一行中都包含所有这些维度字段,则确实有非常大的行,并且对于典型查询而言,找到进入这些行的方式将非常糟糕.

事实很多,因此,如果您可以使这些表紧凑,并且可以过滤'wordier'维度,那么您将碰到一个性能最佳点,除非对表进行大量索引,否则单个表将无法匹配.

是的,对于事实而言,单个表在表数方面更简单,但导航起来真的更容易吗?维度和事实是易于理解的概念,如果您想跨事实查询,该怎么办?您有许多不同的数据集市,但是首先拥有数据仓库的好处之一是它们没有区别-它们是相关的并且可以报告.尺寸一致可以做到这一点.

I'm trying to design a Data Warehouse for a single store of commonly required data ranging from finance systems, project scheduling systems and a myriad of scientific systems. I.e. many different data marts.

I have been reading up on Data Warehousing and popular methods such as Star Schemas and Kimball methods etc but one question I cannot find answer to is:

Why is it better to design your DW Data Mart as a star schema rather than a single flat table?

Surely having no joins between facts and attributes/dimensions is faster and simpler than having lots of small joins to all the dimension tables? Disk space is not a problem, we'll just throw more disks at the database if necessary. Is the star schema slightly outdated these days or is it still data architect dogma?

解决方案

Your question is very good: the Kimball mantra for dimensional modelling is to improve performance and to improve usability.

But I don't think it is outdated, or dogma- it is a reasonable, practical approach for many situations and platforms.

The way relational DBs store data means there's a balancing act to be struck between the numbers and types of tables, the routes in to the data for typical queries, easy maintainability and description of relationships between data, the numbers of joins, the way the joins are constructed, the indexability of columns, etc.

3NF (or further) is one end of the spectrum, suiting OLTP systems, and a single table is the other end of the spectrum. Dimensional models are in the middle and appropriate for reporting, at least when using certain technologies.

Performance isn't all about 'number of joins', although a star schema performs better for reporting workloads than a fully normalised database, in part because of a reduce number of joins. Dimensions are typically very wide. If you are including all those dimension fields in every row of every fact, you have very large rows indeed, and finding your way into those rows will perform very badly for typical queries.

Facts are numerous, so if you can make those tables compact, with the 'wordier' dimensions filterable, you hit a sweet spot of performance that a single table isn't going to match, unless heavily indexed.

And yes a single table for a fact is simpler in terms of numbers of tables but is it really easier to navigate? Dimensions and facts are easy concepts to understand, and what if you want to cross you queries across facts? You've got many different data marts but one of the benefits of having a data warehouse in the first place is that these aren't distinct- they're related and can be reported across. Conformed dimensions enable this.

这篇关于数据仓库-星型架构与平面表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆