OLAP 数据库是否应该针对读取性能进行非规范化? [英] Should OLAP databases be denormalized for read performance?

查看:14
本文介绍了OLAP 数据库是否应该针对读取性能进行非规范化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直认为数据库应该为读取性能进行非规范化,因为它是为 OLAP 数据库设计所做的,而不是为 OLTP 设计进一步夸大 3NF.

PerformanceDBA 在各种帖子中,例如,在 .(内联过大;部分浏览器无法加载内联;点击链接)

它允许我生成这些 ▶Charts Like This◀,在收到客户的原始监控统计文件后敲击六次.注意混合搭配;操作系统和服务器在同一图表上;各种支点.(经许可使用.)

不熟悉关系数据库建模标准的读者可以找到▶IDEF1X 符号◀ 很有帮助.

6NF 数据仓库

最近 Anchor Modeling 对此进行了验证,因为他们现在将 6NF 作为下一代" 用于数据仓库的 OLAP 模型.(他们不提供来自单一数据版本的 OLTP 和 OLAP,这是我们自己的数据).

数据仓库(仅限)体验

我只在数据仓库(不是上述 6NF OLTP-OLAP 数据库)方面的经验是几项主要任务,而不是完整的实施项目.结果毫不意外:

  • 与科学一致,归一化结构的执行速度要快得多;更容易维护;并且需要较少的数据同步.Inmon,而不是 Kimball.

  • 与魔术一致,在我对一堆表格进行规范化并通过应用物理定律显着提高性能之后,唯一感到惊讶的是魔术师的咒语.

有科学头脑的人不会那样做;他们不相信或依赖灵丹妙药和魔法;他们使用和勤奋的科学来解决他们的问题.

有效的数据仓库理由

这就是为什么我在其他帖子中指出,单独的数据仓库平台、硬件、ETL、维护等的唯一有效理由是有许多数据库或数据库",全部合并到一个中央仓库中,用于报告和 OLAP.

金球

有必要介绍一下 Kimball,因为他是数据仓库中非规范化性能"的主要支持者.根据我上面的定义,他是那些显然在他们的生活中从未正常化过的人之一;他的起点是非规范化的(伪装为非规范化"),他只是在 Dimension-Fact 模型中实现了它.

  • 当然,要获得任何性能,他必须更多地去规范化",并创建更多重复项,并证明这一切都是合理的.

    • 因此,以一种精神分裂症的方式,通过制作更专业的副本去规范化"非规范化结构,提高读取性能"是正确的.当考虑整体时,这是不正确的;只有在那个小庇护所内才是正确的,而不是在外面.

    • 同样,以那种疯狂的方式,在所有表"都是怪物的情况下,联接很昂贵"也是应该避免的.他们从未有过连接较小的表和集合的经验,因此他们无法相信越多、越小的表越快的科学事实.

    • 他们的经验是创建重复的表"更快,所以他们无法相信消除重复比这更快.

  • 他的维度被添加到非标准化数据中.那么数据没有标准化,所以没有维度被暴露.而在规范化模型中,维度已经公开,作为数据的组成部分,不需要添加.

  • Kimball's 铺好的小路通向悬崖,在那里,更多旅鼠以更快的速度坠落死亡.旅鼠是群居动物,只要它们一起走在路上,一起死去,他们就会快乐地死去.旅鼠不会寻找其他路径.

所有只是故事,同一个神话的一部分,它们一起出现并相互支持.

您的使命

你是否应该选择接受它.我要求你自己思考,停止接受任何与科学和物理定律相矛盾的想法.无论它们是多么普遍、多么神秘或多么神话.在相信任何事情之前,先寻求证据.保持科学,为自己验证新的信念.重复为性能去规范化"的口头禅不会使您的数据库更快,只会让您感觉更好.就像坐在场边的小胖子告诉自己,他跑得比比赛中所有的孩子都快.

  • 在此基础上,即使是为 OLTP 规范化"的概念,但反过来说,为 OLAP 去规范化"也是矛盾的.物理定律如何在一台计算机上按规定工作,而在另一台计算机上则相反?头脑发懵.这是不可能的,在每台计算机上都以同样的方式工作.

问题?

I always thought that databases should be denormalized for read performance, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.

PerformanceDBA in various posts, for ex., in Performance of different aproaches to time-based data defends the paradigm that database should be always well-designed by normalization to 5NF and 6NF (Normal Form).

Have I understood it correctly (and what had I understood correctly)?

What's wrong with the traditional denormalization approach/paradigm design of OLAP databases (below 3NF) and the advice that 3NF is enough for most practical cases of OLTP databases?

For example:

I should confess that I could never grasp the theories that denormalization facilitates read performance. Can anybody give me references with good logical explanations of this and of the contrary beliefs?

What are sources to which I can refer when trying to convince my stakeholders that OLAP/Data Warehousing databases should be normalized?

To improve visibility I copied here from comments:

"It would be nice if participants would add (disclose) how many real-life (no science projects included) data-warehouse implementations in 6NF they have seen or participated in. Kind of a quick-pool. Me = 0." – Damir Sudarevic

Wikipedia's Data Warehouse article tells:

"The normalized approach [vs. dimensional one by Ralph Kimball], also called the 3NF model (Third Normal Form) whose supporters are referred to as "Inmonites", believe in Bill Inmon's approach in which it is stated that the data warehouse should be modeled using an E-R model/normalized model."

It looks like the normalized data warehousing approach (by Bill Inmon) is perceived as not exceeding 3NF (?)

I just want to understand what is the origin of the myth (or ubiquitous axiomatic belief) that data warehousing/OLAP is synonym of denormalization?

Damir Sudarevic answered that they are well-paved approach. Let me return to the question: Why is denormalization believed to facilitate reading?

解决方案

Mythology

I always thought that databases should be denormalized for reading, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.

There's a myth to that effect. In the Relational Database context, I have re-implemented six very large so-called "de-normalised" "databases"; and executed over eighty assignments correcting problems on others, simply by Normalising them, applying Standards and engineering principles. I have never seen any evidence for the myth. Only people repeating the mantra as if it were some sort of magical prayer.

Normalisation vs Un-normalised

("De-normalisation" is a fraudulent term I refuse to use it.)

This is a scientific industry (at least the bit that delivers software that does not break; that put people on the Moon; that runs banking systems; etc). It is governed by the laws of physics, not magic. Computers and software are all finite, tangible, physical objects that are subject to the laws of physics. According to the secondary and tertiary education I received:

  • it is not possible for a bigger, fatter, less organised object to perform better than a smaller, thinner, more organised object.

  • Normalisation yields more tables, yes, but each table is much smaller. And even though there are more tables, there are in fact (a) fewer joins and (b) the joins are faster because the sets are smaller. Fewer Indices are required overall, because each smaller table needs fewer indices. Normalised tables also yield much shorter row sizes.

  • for any given set of resources, Normalised tables:

    • fit more rows into the same page size
    • therefore fit more rows into the same cache space, therefore overall throughput is increased)
    • therefore fit more rows into the same disk space, therefore the no of I/Os is reduced; and when I/O is called for, each I/O is more efficient.
      .
  • it is not possible for an object that is heavily duplicated to perform better than an object that is stored as a single version of the truth. Eg. when I removed the 5 x duplication at the table and column level, all the transactions were reduced in size; the locking reduced; the Update Anomalies disappeared. That substantially reduced contention and therefore increased concurrent use.

The overall result was therefore much, much higher performance.

In my experience, which is delivering both OLTP and OLAP from the same database, there has never been a need to "de-normalise" my Normalised structures, to obtain higher speed for read-only (OLAP) queries. That is a myth as well.

  • No, the "de-normalisation" requested by others reduced speed, and it was eliminated. No surprise to me, but again, the requesters were surprised.

Many books have been written by people, selling the myth. It needs to be recognised that these are non-technical people; since they are selling magic, the magic they sell has no scientific basis, and they conveniently avoid the laws of physics in their sales pitch.

(For anyone who wishes to dispute the above physical science, merely repeating the mantra will no have any effect, please supply specific evidence supporting the mantra.)

Why is the Myth Prevalent ?

Well, first, it is not prevalent among scientific types, who do not seek ways of overcoming the laws of physics.

From my experience, I have identified three major reasons for the prevalence:

  1. For those people who cannot Normalise their data, it is a convenient justification for not doing so. They can refer to the magic book and without any evidence for the magic, they can reverently say "see a famous writer validates what I have done". Not Done, most accurately.

  2. Many SQL coders can write only simple, single-level SQL. Normalised structures require a bit of SQL capability. If they do not have that; if they cannot produce SELECTs without using temporary tables; if they cannot write Sub-queries, they will be psychologically glued to the hip to flat files (which is what "de-normalised" structures are), which they can process.

  3. People love to read books, and to discuss theories. Without experience. Especially re magic. It is a tonic, a substitute for actual experience. Anyone who has actually Normalised a database correctly has never stated that "de-normalised is faster than normalised". To anyone stating the mantra, I simply say "show me the evidence", and they have never produced any. So the reality is, people repeat the mythology for these reasons, without any experience of Normalisation. We are herd animals, and the unknown is one of our biggest fears.

    That is why I always include "advanced" SQL and mentoring on any project.

My Answer

This Answer is going to be ridiculously long if I answer every part of your question or if I respond to the incorrect elements in some of the other answers. Eg. the above has answered just one item. Therefore I will answer your question in total without addressing the specific components, and take a different approach. I will deal only in the science related to your question, that I am qualified in, and very experienced with.

Let me present the science to you in manageable segments.
The typical model of the six large scale full implementation assignments.

  • These were the closed "databases" commonly found in small firms, and the organisations were large banks
  • very nice for a first generation, get-the-app-running mindset, but a complete failure in terms of performance, integrity and quality
  • they were designed for each app, separately
  • reporting was not possible, they could only report via each app
  • since "de-normalised" is a myth, the accurate technical definition is, they were un-normalised
    • In order to "de-normalise" one must Normalise first; then reverse the process a little in every instance where people showed me their "de-normalised" data models, the simple fact was, they had not Normalised at all; so "de-normalisation" was not possible; it was simply un-normalised
  • since they did not have much Relational technology, or the structures and control of Databases, but they were passed off as "databases", I have placed those words in quotation marks
  • as is scientifically guaranteed for un-normalised structures, they suffered multiple versions of the truth (data duplication) and therefore high contention and low concurrency, within each of them
  • they had an additional problem of data duplication across the "databases"
  • the organisation was trying to keep all those duplicates synchronised, so they implemented replication; which of course meant an additional server; ETL and synching scripts to be developed; and maintained; etc
  • needless to say, the synching was never quite enough and they were forever changing it
  • with all that contention and low throughput, it was no problem at all justifying a separate server for each "database". It did not help much.

So we contemplated the laws of physics, and we applied a little science.
We implemented the Standard concept that the data belongs to the corporation (not the departments) and the corporation wanted one version of the truth. The Database was pure Relational, Normalised to 5NF. Pure Open Architecture, so that any app or report tool could access it. All transactions in stored procs (as opposed to uncontrolled strings of SQL all over the network). The same developers for each app coded the new apps, after our "advanced" education.

Evidently the science worked. Well, it wasn't my private science or magic, it was ordinary engineering and the laws of physics. All of it ran on one database server platform; two pairs (production & DR) of servers were decommissioned and given to another department. The 5 "databases" totalling 720GB were Normalised into one Database totalling 450GB. About 700 tables (many duplicates and duplicated columns) were normalised into 500 unduplicated tables. It performed much faster, as in 10 times faster overall, and more than 100 times faster in some functions. That did not surprise me, because that was my intention, and the science predicted it, but it surprised the people with the mantra.

More Normalisation

Well, having had success with Normalisation in every project, and confidence with the science involved, it has been a natural progression to Normalise more, not less. In the old days 3NF was good enough, and later NFs were not yet identified. In the last 20 years, I have only delivered databases that had zero update anomalies, so it turns out by todays definitions of NFs, I have always delivered 5NF.

Likewise, 5NF is great but it has its limitations. Eg. Pivoting large tables (not small result sets as per the MS PIVOT Extension) was slow. So I (and others) developed a way of providing Normalised tables such that Pivoting was (a) easy and (b) very fast. It turns out, now that 6NF has been defined, that those tables are 6NF.

Since I provide OLAP and OLTP from the same database, I have found that, consistent with the science, the more Normalised the structures are:

  • the faster they perform

  • and they can be used in more ways (eg Pivots)

So yes, I have consistent and unvarying experience, that not only is Normalised much, much faster than un-normalised or "de-normalised"; more Normalised is even faster than less normalised.

One sign of success is growth in functionality (the sign of failure is growth in size without growth in functionality). Which meant they immediately asked us for more reporting functionality, which meant we Normalised even more, and provided more of those specialised tables (which turned out years later, to be 6NF).

Progressing on that theme. I was always a Database specialist, not a data warehouse specialist, so my first few projects with warehouses were not full-blown implementations, but rather, they were substantial performance tuning assignments. They were in my ambit, on products that I specialised in.
Let's not worry about the exact level of normalisation, etc, because we are looking at the typical case. We can take it as given that the OLTP database was reasonably normalised, but not capable of OLAP, and the organisation had purchased a completely separate OLAP platform, hardware; invested in developing and maintaining masses of ETL code; etc. And following implementation then spent half their life managing the duplicates they had created. Here the book writers and vendors need to be blamed, for the massive waste of hardware and separate platform software licences they cause organisations to purchase.

  • If you have not observed it yet, I would ask you to notice the similarities between the Typical First Generation "database" and the Typical Data Warehouse

Meanwhile, back at the farm (the 5NF Databases above) we just kept adding more and more OLAP functionality. Sure the app functionality grew, but that was little, the business had not changed. They would ask for more 6NF and it was easy to provide (5NF to 6NF is a small step; 0NF to anything, let alone 5NF, is a big step; an organised architecture is easy to extend).

One major difference between OLTP and OLAP, the basic justification of separate OLAP platform software, is that the OLTP is row-oriented, it needs transactionally secure rows, and fast; and the OLAP doesn't care about the transactional issues, it needs columns, and fast. That is the reason all the high end BI or OLAP platforms are column-oriented, and that is why the OLAP models (Star Schema, Dimension-Fact) are column-oriented.

But with the 6NF tables:

  • there are no rows, only columns; we serve up rows and columns at same blinding speed

  • the tables (ie. the 5NF view of the 6NF structures) are already organised into Dimension-Facts. In fact they are organised into more Dimensions than any OLAP model would ever identify, because they are all Dimensions.

  • Pivoting entire tables with aggregation on the fly (as opposed to the PIVOT of a small number of derived columns) is (a) effortless, simple code and (b) very fast

What we have been supplying for many years, by definition, is Relational Databases with at least 5NF for OLTP use, and 6NF for OLAP requirements.

  • Notice that it is the very same science that we have used from the outset; to move from Typical un-normalised "databases" to 5NF Corporate Database. We are simply applying more of the proven science, and obtaining higher orders of functionality and performance.

  • Notice the similarity between 5NF Corporate Database and 6NF Corporate Database

  • The entire cost of separate OLAP hardware, platform software, ETL, administration, maintenance, are all eliminated.

  • There is only one version of the data, no update anomalies or maintenance thereof; the same data served up for OLTP as rows, and for OLAP as columns

The only thing we have not done, is to start off on a new project, and declare pure 6NF from the start. That is what I have lined up next.

What is Sixth Normal Form ?

Assuming you have a handle on Normalisation (I am not going to not define it here), the non-academic definitions relevant to this thread are as follows. Note that it applies at the table level, hence you can have a mix of 5NF and 6NF tables in the same database:

  • Fifth Normal Form: all Functional Dependencies resolved across the database
    • in addition to 4NF/BCNF
    • every non-PK column is 1::1 with its PK
    • and to no other PK
    • No Update Anomalies
      .
  • Sixth Normal Form: is the irreducible NF, the point at which the data cannot be further reduced or Normalised (there will not be a 7NF)
    • in addition to 5NF
    • the row consists of a Primary Key, and at most, one non-key column
    • eliminates The Null Problem

What Does 6NF Look Like ?

The Data Models belong to the customers, and our Intellectual Property is not available for free publication. But I do attend this web-site, and provide specific answers to questions. You do need a real world example, so I will publish the Data Model for one of our internal utilities.

This one is for the collection of server monitoring data (enterprise class database server and OS) for any no of customers, for any period. We use this to analyse performance issues remotely, and to verify any performance tuning that we do. The structure has not changed in over ten years (added to, with no change to the existing structures), it is typical of the specialised 5NF that many years later was identified as 6NF. Allows full pivoting; any chart or graph to be drawn, on any Dimension (22 Pivots are provided but that is not a limit); slice and dice; mix and match. Notice they are all Dimensions.

The monitoring data or Metrics or vectors can change (server version changes; we want to pick up something more) without affecting the model (you may recall in another post I stated EAV is the bastard son of 6NF; well this is full 6NF, the undiluted father, and therefore provides all features of EAV, without sacrificing any Standards, integrity or Relational power); you merely add rows.

▶Monitor Statistics Data Model◀. (too large for inline; some browsers cannot load inline; click the link)

It allows me to produce these ▶Charts Like This◀, six keystrokes after receiving a raw monitoring stats file from the customer. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. (Used with permission.)

Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the ▶IDEF1X Notation◀ helpful.

6NF Data Warehouse

This has been recently validated by Anchor Modeling, in that they are now presenting 6NF as the "next generation" OLAP model for data warehouses. (They do not provide the OLTP and OLAP from the single version of the data, that is ours alone).

Data Warehouse (Only) Experience

My experience with Data Warehouses only (not the above 6NF OLTP-OLAP Databases), has been several major assignments, as opposed to full implementation projects. The results were, no surprise:

  • consistent with the science, Normalised structures perform much faster; are easier to maintain; and require less data synching. Inmon, not Kimball.

  • consistent with the magic, after I Normalise a bunch of tables, and deliver substantially improved performance via application of the laws of physics, the only people surprised are the magicians with their mantras.

Scientifically minded people do not do that; they do not believe in, or rely upon, silver bullets and magic; they use and hard work science to resolve their problems.

Valid Data Warehouse Justification

That is why I have stated in other posts, the only valid justification for a separate Data Warehouse platform, hardware, ETL, maintenance, etc, is where there are many Databases or "databases", all being merged into a central warehouse, for reporting and OLAP.

Kimball

A word on Kimball is necessary, as he is the main proponent of "de-normalised for performance" in data warehouses. As per my definitions above, he is one of those people who have evidently never Normalised in their lives; his starting point was un-normalised (camouflaged as "de-normalised") and he simply implemented that in a Dimension-Fact model.

  • Of course, to obtain any performance, he had to "de-normalise" even more, and create further duplicates, and justify all that.

    • So therefore it is true, in a schizophrenic sort of way, that "de-normalising" un-normalised structures, by making more specialised copies, "improves read performance". It is not true when the whole is taking into account; it is true only inside that little asylum, not outside.

    • Likewise it is true, in that crazy way, that where all the "tables" are monsters, that "joins are expensive" and something to be avoided. They have never had the experience of joining smaller tables and sets, so they cannot believe the scientific fact that more, smaller tables are faster.

    • they have experience that creating duplicate "tables" is faster, so they cannot believe that eliminating duplicates is even faster than that.

  • his Dimensions are added to the un-normalised data. Well the data is not Normalised, so no Dimensions are exposed. Whereas in a Normalised model, the Dimensions are already exposed, as an integral part of the data, no addition is required.

  • that well-paved path of Kimball's leads to the cliff, where more lemmings fall to their deaths, faster. Lemmings are herd animals, as long as they are walking the path together, and dying together, they die happy. Lemmings do not look for other paths.

All just stories, parts of the one mythology that hang out together and support each other.

Your Mission

Should you choose to accept it. I am asking you to think for yourself, and to stop entertaining any thoughts that contradict science and the laws of physics. No matter how common or mystical or mythological they are. Seek evidence for anything before trusting it. Be scientific, verify new beliefs for yourself. Repeating the mantra "de-normalised for performance" won't make your database faster, it will just make you feel better about it. Like the fat kid sitting in the sidelines telling himself that he can run faster than all the kids in the race.

  • on that basis, even the concept "normalise for OLTP" but do the opposite, "de-normalise for OLAP" is a contradiction. How can the laws of physics work as stated on one computer, but work in reverse on another computer ? The mind boggles. It is simply not possible, the work that same way on every computer.

Questions ?

这篇关于OLAP 数据库是否应该针对读取性能进行非规范化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆