如何不规范连续数据(INTS,FLOATS,DATETIME等)? [英] How do I not normalize continuous data (INTS, FLOATS, DATETIME, ....)?

查看:100
本文介绍了如何不规范连续数据(INTS,FLOATS,DATETIME等)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据我的理解-如果我错了,请纠正我-规范化是从数据库中删除冗余数据的过程

According to my understanding - and correct me if I'm wrong - "Normalization" is the process of removing the redundant data from the database-desing

但是,当我尝试学习数据库优化/性能调整时,遇到了先生。里克·詹姆斯(Rick James)建议反对标准化连续值,例如(INTS,FLOATS,DATETIME等)

However, when I was trying to learn about database optimizing/tuning for performance, I encountered that Mr. Rick James recommend against normalizing continuous values such as (INTS, FLOATS, DATETIME, ...)


规范化,但不要过度规范化。特别是,请勿对
日期时间或浮点数或其他连续值进行规范化。

"Normalize, but don't over-normalize." In particular, do not normalize datetimes or floats or other "continuous" values.


当然,纯粹主义者说可以正常化时间。那是个大错误。通常,不应对
的连续值进行规范化,因为您通常
希望对其进行范围查询。如果将其标准化,则性能
将会恶化几个数量级。

Sure purists say normalize time. That is a big mistake. Generally, "continuous" values should not be normalized because you generally want to do range queries on them. If it is normalized, performance will be orders of magnitude worse.

标准化有几个目的;它们在这里并不真正适用:

Normalization has several purposes; they don't really apply here:


  • 节省空间-时间戳为4个字节;用于归一化的MEDIUMINT为3;节省不了多少

  • Save space -- a timestamp is 4 bytes; a MEDIUMINT for normalizing is 3; not much savings

允许更改公用值(例如,在一处将 International Business Machines更改为 IBM)–在此不再赘述;每个
时间都是独立分配的,而且您不是时间主。

To allow for changing the common value (eg changing "International Business Machines" to "IBM" in one place) -- not relevent here; each time was independently assigned, and you are not a Time Lord.

对于日期时间,规范化表中可能包含诸如星期几,一天中的小时。是的,但是性能仍然差强人意。

In the case of datetime, the normalization table could have extra columns like "day of week", "hour of day". Yeah, but performance still sucks.


请勿标准化连续值-日期,浮点等-
,尤其是如果您要进行范围查询。

Do not normalize "continuous" values -- dates, floats, etc -- especially if you will do range queries.

我试图理解这一点,但是我不能,有人可以向我解释一下,并举一个最坏的例子,将这个规则应用于会提高性能吗?。

I tried to understand this point but I couldn't, can someone please explain this to me and give me an example of the worst case that applying this rule on will enhance the performance ?.

注意:我本可以在评论中问他,但我想单独记录并强调这一点,因为我认为这是非常重要的注释,几乎影响了我的整个数据库性能

推荐答案

评论(到目前为止)正在讨论味u术语标准化本身。我接受这种批评。

The Comments (so far) are discussing the misuse of the term "normalization". I accept that criticism. Is there a term for what is being discussed?

让我用这个例子来详细说明我的要求 ...一些DBA代替了 DATE 带有代理ID;使用日期范围时,这可能会导致严重的性能问题。对比这些:

Let me elaborate on my 'claim' with this example... Some DBAs replace a DATE with a surrogate ID; this is likely to cause significant performance issues when a date range is used. Contrast these:

-- single table
SELECT ...
    FROM t
    WHERE x = ...
      AND date BETWEEN ... AND ...;   -- `date` is of datatype DATE/DATETIME/etc

-- extra table
SELECT ...
    FROM t
    JOIN Dates AS d  ON t.date_id = d.date_id
    WHERE t.x = ...
      AND d.date BETWEEN ... AND ...;  -- Range test is now in the other table

将范围测试移至 JOINed 表会导致速度变慢。

Moving the range test to a JOINed table causes the slowdown.

第一个查询可以通过

INDEX(x, date)

在第二个查询中,优化程序将(至少对于MySQL而言)选择两个表中的一个作为开始,然后对另一个表进行一些繁琐的来回处理,以处理 WHERE 。 (使用的其他引擎还有其他技术,但是仍然要花费大量成本。)

In the second query, the Optimizer will (for MySQL at least) pick one of the two tables to start with, then do a somewhat tedious back-and-forth to the other table to handle rest of the WHERE. (Other Engines use have other techniques, but there is still a significant cost.)

DATE 是其中之一您可能会进行范围测试的数据类型。因此,我对它的声明适用于任何连续数据类型(整数,日期,浮点数)。

DATE is one of several datatypes where you are likely to have a "range" test. Hence my proclamations about it applying to any "continuous" datatypes (ints, dates, floats).

即使您没有范围测试,也可能没有性能受益于辅助表。我经常看到3字节的 DATE 被4字节的 INT 代替,从而使主表变大了! 复合索引几乎总是会导致单表方法的查询效率更高。

Even if you don't have a range test, there may be no performance benefit from the secondary table. I often see a 3-byte DATE being replaced by a 4-byte INT, thereby making the main table larger! A "composite" index almost always will lead to a more efficient query for the single-table approach.

这篇关于如何不规范连续数据(INTS,FLOATS,DATETIME等)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆