基于时间的数据的不同方法的性能 [英] Performance of different approaches to time-based data

查看:90
本文介绍了基于时间的数据的不同方法的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是根据PerformanceDBA在此中的要求提出的该查询的另一个答案:

I'm asking this in the context of PerformanceDBA's claim in this answer to another question that this query:

SELECT  ProductId,
        Description
    FROM  Product       p,
          ProductStatus ps
    WHERE p.ProductId = ps.ProductId  -- Join
    AND   StatusCode  = 2             -- Request
    AND   DateTime    = (             -- Current Status on the left ...
        SELECT MAX(DateTime)          -- Current Status row for outer Product
            FROM  ProductStatus ps_inner
            WHERE p.ProductId = ps_inner.ProductId
            )

使用仅保存有效(开始)日期(随时间变化的状态)的ProductStatus表将胜过此查询:

using a ProductStatus table that holds only an effective (start) date for a status that changes over time, will outperform this query:

SELECT  ProductId,
        Description
    FROM  Product       p,
          ProductStatus ps
    WHERE p.ProductId = ps.ProductId  -- Join
    AND   StatusCode  = 2             -- Request
    AND   getdate() BETWEEN DateFrom AND Dateto

使用同时包含状态开始日期和结束日期的ProductStatus表。

using a ProductStatus table that holds both a start and an end date for the status.

虽然我接受第一种方法优于第二种方法的其他说法,但是我希望第二种方法更快(根据我在Oracle上的经验仅),因为它仅过滤数据而不是执行其他子查询并与之比较。

While I accept the other claims made for the first approach being better than the second, I would however expect the second approach to be faster (based on my experience with Oracle only) because it merely filters the data rather than performing an additional subquery and comparing with it.

我想知道Sybase或SQL Server如何处理这些查询,以及一些简单测试中的相对性能。

I'd like to know how Sybase or SQL Server would process these queries, and what the relative performance is in some simple tests.

推荐答案

一方面,打开一个新问题很好。但是,另一方面,通过提取一个查询并询问其执行速度是否更快,失去了上一个问题的上下文,因此新问题过于孤立。如您所知,管理数据库,管理资源(内存/缓存,磁盘,CPU周期),管理使用这些资源的代码(好坏)都是全部内容。性能是一个交易游戏,没有免费的东西。

On the one hand, it is good that you have opened a new question. But on the hand, by extracting one query and asking if it performs faster, loses the context of the previous question, the new question is too isolated. As I am sure you know, administering a database, managing resources (memory/cache, disk, CPU cycles), managing code (good or poor) that uses those resources, are all part of the whole picture. Performance is a trading game, nothing is free.


  1. 我遇到的最重要的问题是EndDate列的重复,很容易得出。重复的列等于更新异常。斯米尔金曼(Smirkingman)提供了经典的例子:有些查询将得到一个结果,而其他查询将得到另一个结果。大型组织根本无法接受。或在银行中(至少在发达国家中)对数据进行审核和保护。您已经违反了基本的规范化规则,并且需要支付罚款。

  1. The foremost issue I had, was the duplication of the EndDate column, which is easily derived. Duplicated columns equals Update Anomalies. Smirkingman has provided the classic example: some queries will get one result and other queries will get the other. That is simply not acceptable is large organisations; or in banks (at least in developed countries) where the data is audited and protected. You've broken a basic Normalisation rule, and there are penalties to be paid.


  • 更新Anomail;两个版本(已经详细介绍)。审核员可能不会通过系统。

  • Update Anomailes; two versions (already detailed). Auditors may not pass the system.

表大小

在任何大表中,这都是一个问题,尤其是在时间序列或时间上数据,列数少,行数大。因此,有人会说磁盘空间便宜。是的,性病也是如此。重要的是它的用途是什么,以及它的使用情况如何。

Table Size
In any large table it is a problem, and especially in time series or temporal data, where the number of columns are small, and the number of rows is huge. So what, some will say, disk space is cheap. Yeah, so are STDs. What matters is what it is used for, and how well one takes care of it.


  • 磁盘空间

    在PC上可能很便宜,但在生产服务器中则不是。基本上,您已经将行大小增加了62%(13加8等于21),因此表大小也增加了。在当前分配给我的银行中,拥有数据的每个部门都按以下方式收费,基于SAN的存储就足够了。数字是每月每GB(这不是高端澳大利亚银行):

  • Disk space
    May be cheap on a PC, but in a production server it is not. Basically you have added 62% to the row size (13 plus 8 equals 21) and therefore the table size. At the bank I am currently assigned, each department that owns the data is charged as follows, SAN-based storage is all there is. Figures are for per GB per Month (this is not a high end Aussie bank):

$ 1.05的RAID5 Unmirrored

(我们知道它很慢,但价格便宜,只是不要在上面放上重要信息,因为如果它损坏了,在新磁盘热插入或冷交换后,它需要几天的时间才能重新同步自身。)

$1.05 for RAID5 Unmirrored
(we know it is slow, but it is cheap, just do not put important info on it, cause if it breaks, after the new disk is hot or cold-swapped in, it takes days for it to re-synch itself.)

$ 2.10

在SAN中,即。

$2.10 for RAID5 Mirrored
In the SAN, that is.

$ 4.40 for RAID1 + 0

生产数据,备份的事务日志和夜间数据库转储的最小值。

$4.40 for RAID1+0
Minimum for Production data, transaction logs backed up, and nightly database dumps.

$ 9.80 for RAID1 + 0 Replicated

到另一个防弹站点上的相同SAN布局。数分钟即可完成生产切换;交易损失几乎为零。

$9.80 for RAID1+0 Replicated
To an identical SAN Layout at another, bomb proof, site. Production cut-over in minutes; almost zero transaction loss.

内存/高速缓存

好​​的,Oracle没有它,但是严重的银行业数据库确实具有高速缓存,并且可以对其进行管理。给定任何特定的缓存大小,只有62%的行可以容纳相同的缓存大小。

Memory/Cache
Ok, Oracle does not have it but the serious banking dbs do have caches, and they are managed. Given any specific cache size, only 62% of the rows will fit into the same cache size.

Logical&物理I / O

意味着读取表的I / O增加了50%;流式传输到缓存和磁盘读取。

Logical & Physical I/O
Which means 50% more I/O to read the table; both streaming into cache and disk reads.

因此,查询执行的是更好还是孤立性更差是一个学术问题。在上述情况下,比较慢,每次访问的性能始终差62%。它正在影响服务器上的所有其他用户。大多数DBA都不会在乎(我当然不会在意)子查询表单的执行速度是否是速度的一半,因为它们的好处与审核接受有关,而不仅仅是代码性能。

Therefore, whether the query performs better or worse in isolation, is an academic issue. In the context of the above, the table is slow, and performing 62% worse, all the time, on every access. And it is affecting every other user on the server. Most DBAs will not care (I certainly wouldn't) if the subquery form performs at half the speed, because their bonus is tied to audit acceptance, not just code performance.


  • 此外,它的另一个好处是,由于更新异常,不必重新访问代码并修复交易。

  • Besides, there is the added benefit of never having to revisit code, and fix up transactions due to Update Anomalies.

交易的更新点较少,因此交易点较小;

And the transactions have less points to update, so they are smaller; less blocking locks, etc.

同意,注释中的讨论很困难。在我的答复中,我详细说明了两个子查询。有一个误解:您在谈论这个子查询(在WHERE子句中,是 table子查询),而我在谈论另一个子查询(在列列表中是 scalar子查询

Agreed, that discussion in the Comments are difficult. In my Answer, I have detailed and explained two subqueries. There was a misunderstanding: you were talking about this subquery (in the WHERE clause, a table subquery) and I was talking about the other subquery (in the column list, a scalar subquery) when I said it performs as fast or faster. Now that that has been cleared up, I cannot say that the first query above (subquery in the WHERE clause, a table) will perform as fast as the second query (with the duplicated column); the first has to perform 3 scans, where the second only performs 2 scans. (I dare say the second will table scan though.)

重点是,除了隔离问题之外,这还不是一个公平的比较,我对此发表了评论。标量子查询。我不建议3扫描查询比2扫描查询快或快。

The point is, in addition to the isolation issue, it is not a fair comparison, I made the comment about scalar subqueries. I would not suggest than a 3-scan query is as fast or faster than a 2-scan query.

我对3扫描表子查询(其中我在这里引用)需要在整个上下文中使用(无论是在toto中还是在上面)。我并没有退缩。

The statement I made about the 3-scan table subquery (which I quote here) needs to be taken in the full context (either that post in toto, or the above). I am not backing away from it.


这是普通的SQL,是使用SQL引擎(关系型)的功能的子查询。设置处理。这是一种正确的方法,没有什么比它快的了,而其他任何方法都会更慢。任何报告工具都只需单击几下,无需键入即可生成此代码。

我花了半生时间删除了非法的替代品例如重复的列(以性能问题为准),而创建者高喊口头禅,则表的速度很慢,因此他们降低了性能的规格化。结果在我开始之前是可以预见的,它的大小是表的一半,而总体的运行速度是它的两倍。 Times Series是这里最常见的问题(链接链接到另一个问题;链接到另一个问题),但是想象一下银行数据库中的问题:每日 OpeningExposure ClosingExposure 每个安全性每个持有每个 UnitTrust 每个投资组合

I spend half my life removing Illegal alternatives such as duplicated columns, which are predicated on the issue of performance, with the creators chanting the mantra the the table is slow, so they have "denormalised for performance". The result, predictable before I start, is a table half the size, which performs twice as fast overall. The Times Series is the most common question here (the link links to another question; which links to another), but imagine the problem in a banking database: daily OpeningExposure and ClosingExposureper Securityper HoldingperUnitTrustperPortfolio.

但是让我回答一个尚未解决的问题被问到。这种互动是正常的,与内部开发团队合作时并不罕见。它每月至少出现一次。一个崩溃的热门开发人员已经使用带有重复列的表编写并测试了他的代码,它运行了,现在停滞了,因为我不会将其放入数据库中。

But let me answer a question that has not been asked. This sort of interaction is normal, not uncommon when working with in-house development teams; it comes up at least once a month. A crash hot developer has already written and tested his code, using a table with a duplicated column, it flies, and now it is stalled because I won't put it in the db.

不,我将在整个系统的上下文中对其进行测试,并且:

No, I will test it within the context of the whole system and:


  • 一半的时间,该表不包含EndDate列,因为现在在1秒内执行大约半秒的查询没有什么大不了的。

  • half the time, the table goes in without the EndDate column because there is no big deal about a half second query now performing in one second.

另一半时间,[table subquery]性能不可接受,因此我实现了一个布尔(位)指示符来标识 IsCurrent 。这比重复的列要好得多,并且可以提供2扫描的速度。

The other half of the time, the [table subquery] performance is not acceptable, so I implement a boolean (bit) indicator to identify IsCurrent. That is much better than a duplicated column, and provides 2-scan speeds.

在一百万年之内,您不会让我复制列了。将表大小增加62%;在完整的多用户环境中将表降低 的速度达62%;并可能导致审核失败。而且我不是员工,我没有奖金。

Not in a million years will you get me duplicating a column; adding 62% to the table size; slowing the table down in the full multi-user context by 62%; and risk failing an Audit. And I am not an employee, I do not get a bonus.

现在值得进行测试:使用重复列查询与使用<$ c $查询c> IsCurrent 指标,可以在整体资源使用的整个背景下使用。

Now that would be worth testing: query with a duplicated column vs query with a IsCurrent indicator, in the full context of overall resource use.

Smirkingman提出了一个很好的观点。我会清楚地重申一下,这样它就不会碎片化,然后一个或另一个碎片受到攻击。请不要分手:

Smirkingman has brought up a good point. And I will restate it clearly, so that it does not get fragmented and then one or the other fragment gets attacked. Please do not break this up:

关系数据库,

由经验丰富的关系建模者标准化,将
设为true普通格式


(没有更新异常;没有重复的列),

具有完全的关系合规性

(IDEF1X,尤其与最小化 Id 主键有关,因此不会削弱关系引擎的功能)

将产生更多,更小的表,一个较小的数据库,

的索引更少,

的连接数更少


(是的,更多的表,但更少的连接),

,它将胜过任何破坏任何规则的事情

在相同的硬件上,
企业数据库平台


(不包括免费软件,MS,Oracle;但是请不要让这种情况阻止您)

在生产OLTP的整个使用环境中使用

至少一个顺序幅度,

,并且使用起来更容易

和进行更改


(无需重构)。

A Relational Database,
Normalised by an experienced Relational modeller, to true Fifth Normal Form

(no Update Anomalies; no duplicated columns),
with full Relational Compliance
(IDEF1X, particularly relating to minimisation of Id Primary Keys; and thus not crippling the power of the Relational engine)
will result in more, smaller tables, a smaller database,
with fewer Indices,
requiring fewer joins

(that's right, more tables but fewer joins),
and it will out-perform anything that breaks any of those rules
on the same hardware, and enterprise db platform

(excludes freeware, MS, Oracle; but don't let that stop you),
in the full context of Production OLTP use
by at least one order of magnitude,
and it will be much easier to use
and to change

(never need "refactoring").

我已经完成了至少80次。如果我自己做,两个数量级并不少见,而不是为其他人提供框架。

I have done this at least 80 times. Two orders of magnitude is not uncommon, if I do it myself, rather than providing the framework for someone else to do it.

我(不是与我一起工作的人或付钱给我的人)都不会在乎一个查询将如何做。

Neither I, not the people I work with or who pay me, care what one query will do in isolation.

这篇关于基于时间的数据的不同方法的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆