在考虑可伸缩性时,为什么联接不好? [英] Why are joins bad when considering scalability?

查看:103
本文介绍了在考虑可伸缩性时,为什么联接不好?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么连接不好或速度慢".我知道我再听一次.我找到了这句话

Why are joins bad or 'slow'. I know i heard this more then once. I found this quote

问题是联接相对 速度慢,尤其是在非常大的数据上 设置,如果它们变慢, 网站速度慢.需要很长时间 得到所有这些单独的位 信息从磁盘上移走 再次在一起.

The problem is joins are relatively slow, especially over very large data sets, and if they are slow your website is slow. It takes a long time to get all those separate bits of information off disk and put them all together again.

来源

我一直以为他们很快,尤其是在查找PK时.他们为什么慢"?

I always thought they were fast especially when looking up a PK. Why are they 'slow'?

推荐答案

可扩展性是关于预先计算,扩展或缩减重复工作至最基本的要求,以最大程度地减少每个工作单元的资源使用.为了实现良好的伸缩性,您无需做任何不需要做的事情,而您实际上要确保所做的事情能尽可能高效地完成.

Scalability is all about pre-computing, spreading out, or paring down the repeated work to the bare essentials, in order to minimize resource use per work unit. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure are done as efficiently as possible.

在这种情况下,连接两个单独的数据源当然相对较慢,至少与不将它们连接相比,这是因为您需要在用户要求的时候进行工作.

In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live at the point where the user requests it.

但是请记住,替代方案完全不再有两个单独的数据了.您必须将两个不同的数据点放在同一记录中.您不能在没有结果的情况下将两个不同的数据合并在一起,因此请确保您了解了这些取舍.

But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without a consequence somewhere, so make sure you understand the trade-off.

好消息是,现代关系数据库在联接时是.您不应该真正认为连接良好而使用良好的数据库会导致连接速度变慢.该数据库提供了许多可扩展性友好的方式来进行原始联接并使它们很多更快:

The good news is modern relational databases are good at joins. You shouldn't really think of joins as slow with a good database used well. The database provides a number of scalability-friendly ways to take raw joins and make them much faster:

  • 加入替代键(自动编号/标识列)而不是自然键.这意味着在加入操作期间进行较小(因此更快)的比较
  • 索引
  • 材料化/索引视图(将其视为预先计算的联接或托管去规范化)
  • 计算列.您可以使用它来哈希或以其他方式预先计算联接的键列,这样,联接的复杂比较现在要小得多,并且可能会被预先索引.
  • 表分区(通过将负载分散到多个磁盘,或将表扫描范围限制为分区扫描来帮助处理大型数据集)
  • OLAP(预计算某些类型的查询/联接的结果.这不是很正确,但是您可以将其视为泛型非规范化)
  • 复制,可用性组,日志传送或其他机制,可让多台服务器回答对同一数据库的读取查询,从而在几台服务器之间扩展工作量.
  • Join on a surrogate key (autonumer/identity column) rather than a natural key. This means smaller (and therefore faster) comparisons during the join operation
  • Indexes
  • Materialized/indexed views (think of this as a pre-computed join or managed de-normalization)
  • Computed columns. You can use this to hash or otherwise pre-compute the key columns of a join, such that what would be a complicated comparison for a join is now much smaller and potentially pre-indexed.
  • Table partitions (helps with large data sets by spreading the load out to multiple disks, or limiting what might have been a table scan down to a partition scan)
  • OLAP (pre-computes results of certain kinds of queries/joins. It's not quite true, but you can think of this as generic denormalization)
  • Replication, Availability Groups, Log shipping, or other mechanisms to let multiple servers answer read queries for the same database, and thus scale your workload out among several servers.

我什至会说关系数据库存在的主要原因是让您有效地进行联接 * .当然不仅仅是存储结构化数据(您可以使用csv或xml之类的平面文件构造来实现此目的).我列出的一些选项甚至可以让您预先完全建立连接,因此在发出查询之前,结果已经完成.就像您对数据进行了非规范化一样(不可否认,这是以较慢的写入操作为代价).

I would go as far as saying the main reason relational databases exist at all is to allow you do joins efficiently*. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query — just as if you had denormalized the data (admittedly at the cost of slower write operations).

如果连接速度较慢,则可能是数据库使用不正确.

仅在这些其他技术失败之后才可以进行非规范化.真正判断失败"的唯一方法是设定有意义的性能目标并根据这些目标进行衡量.如果您还没有测量的话,甚至考虑去规范化还为时过早.

De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.

*也就是说,作为与表的单独集合不同的实体而存在.真正的rdbms的另一个原因是安全的并发访问.

这篇关于在考虑可伸缩性时,为什么联接不好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆