WHERE子句中的字段顺序是否会影响MySQL的性能? [英] Does the order of fields in a WHERE clause affect performance in MySQL?

查看:104
本文介绍了WHERE子句中的字段顺序是否会影响MySQL的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在表中有两个索引字段-typeuserid(单个索引,不是复合索引).

I have two indexed fields in a table - type and userid (individual indexes, not a composite).

type的字段值非常有限(假设仅为0或1),因此50%的表记录具有相同的type.另一方面,userid值来自更大的集合,因此具有相同userid的记录数量很小.

types field values are very limited (let's say it is only 0 or 1), so 50% of table records have the same type. userid values, on the other hand, come from a much larger set, so the amount of records with the same userid is small.

这些查询中的任何一个都会比其他查询运行得更快吗?

Will any of these queries run faster than the other:

select * from table where type=1 and userid=5
select * from table where userid=5 and type=1

如果两个字段都没有索引,是否会改变行为?

Also if both fields were not indexed, would it change the behavior?

推荐答案

SQL被设计为一种声明性语言,而不是一种过程性语言.因此,查询优化器在确定如何应用它们时应该考虑where子句谓词的顺序.

SQL was designed to be a declarative language, not a procedural one. So the query optimizer should not consider the order of the where clause predicates in determining how to apply them.

我可能会过分简化有关SQL查询优化器的以下讨论.一年前,我按照这些思路写了东西(这真是太有趣了!).如果您真的想研究现代查询优化,请参阅Dan Tow的 SQL调整,来自O'Reilly.

I'm probably going to waaaay over-simplify the following discussion of an SQL query optimizer. I wrote one years ago, along these lines (it was tons of fun!). If you really want to dig into modern query optimization, see Dan Tow's SQL Tuning, from O'Reilly.

在简单的SQL查询优化器中,首先将SQL语句编译为关系代数操作的树.这些操作每个都将一个或多个表作为输入,并产生另一个表作为输出. Scan 是从数据库中读取表的顺序扫描. 排序生成一个排序表. Select 生成一个表,该表根据某些选择条件从另一个表中选择了行. Project 生成一个表,该表仅包含另一个表的某些列. 交叉产品包含两个表,并生成一个输出表,该输出表由它们的行的每个可能的配对组成.

In a simple SQL query optimizer, the SQL statement first gets compiled into a tree of relational algebra operations. These operations each take one or more tables as input and produce another table as output. Scan is a sequential scan that reads a table in from the database. Sort produces a sorted table. Select produces a table whose rows are selected from another table according to some selection condition. Project produces a table with only certain columns of another table. Cross Product takes two tables and produces an output table composed of every conceivable pairing of their rows.

令人困惑的是,SQL SELECT子句被编译成关系代数 Project ,而WHERE子句变成了关系代数 Select . FROM子句变成一个或多个 Joins ,每个将两个表引入并产生一个表.还有其他关系代数运算,涉及集合并集,交集,差和隶属关系,但让我们保持简单.

Confusingly, the SQL SELECT clause is compiled into a relational algebra Project, while the WHERE clause turns into a relational algebra Select. The FROM clause turns into one or more Joins, each taking two tables in and producing one table out. There are other relational algebra operations involving set union, intersection, difference, and membership, but let's keep this simple.

这棵树确实需要优化.例如,如果您有:

This tree really needs to be optimized. For example, if you have:

select E.name, D.name 
from Employee E, Department D 
where E.id = 123456 and E.dept_id = D.dept_id

在500个部门中有5,000名员工,执行未优化的树将盲目地产生一个雇员和一个部门的所有可能组合(一个跨产品),然后只选择 Select 所需的一种组合.员工的 Scan 将产生一个5,000条记录表,部门的 Scan 将产生一个500条记录表,这两个表的跨产品将产生一个2,500,000条记录表,而E.id上的 Select 将获取该2,500,000条记录表,并丢弃除一个记录之外的所有记录.

with 5,000 employees in 500 departments, executing an unoptimized tree will blindly produce all possible combinations of one Employee and one Department (a Cross Product) and then Select out just the one combination that was needed. The Scan of Employee will produce a 5,000 record table, the Scan of Department will produce a 500 record table, the Cross Product of those two tables will produce a 2,500,000 record table, and the Select on E.id will take that 2,500,000 record table and discard all but one, the record that was wanted.

[真正的查询处理器将尝试不在内存中实现所有这些中间表.]

[Real query processors will try not to materialize all of these intermediate tables in memory of course.]

因此,查询优化器遍历了树并应用了各种优化.一种是将每个 Select 分解为一连串的 Selects ,一个用于每个 Select 的顶级条件,条件和条件. -ed在一起. (这称为合取范式".)然后,将各个较小的 Selects 在树中四处移动,并与其他关系代数运算合并以形成更有效的运算.

So the query optimizer walks the tree and applies various optimizations. One is to break up each Select into a chain of Selects, one for each of the original Select's top level conditions, the ones and-ed together. (This is called "conjunctive normal form".) Then the individual smaller Selects are moved around in the tree and merged with other relational algebra operations to form more efficient ones.

在上面的示例中,优化程序首先将E.id = 123456上的 Select 推到昂贵的 Cross Product 操作下方.这意味着跨产品仅产生500行(该雇员和一个部门的每个组合一个).然后,顶级E.dept_id = D.dept_id的 Select 过滤掉499个不需要的行.不错.

In the above example, the optimizer first pushes the Select on E.id = 123456 down below the expensive Cross Product operation. This means the Cross Product just produces 500 rows (one for each combination of that employee and one department). Then the top level Select for E.dept_id = D.dept_id filters out the 499 unwanted rows. Not bad.

如果Employee ID字段上有索引,则优化程序可以将Employee的 Scan 与E.id = 123456上的 Select 结合起来以形成快速索引查找.这意味着从磁盘将只有一行雇员"行读入内存,而不是从5,000行.事情正在上升.

If there's an an index on Employee's id field, then the optimizer can combine the Scan of Employee with the Select on E.id = 123456 to form a fast index Lookup. This means that only one Employee row is read into memory from disk instead of 5,000. Things are looking up.

最后的主要优化方法是对E.dept_id = D.dept_id进行 Select 并将其与交叉产品组合.这将其转换为关系代数 Equijoin 运算.它本身并不能做很多事情.但是,如果Department.dept_id上有一个索引,则可以将提供 Equijoin 的Department的较低级别顺序 Scan 转换为非常快的 Lookup 索引. >我们一位员工的部门记录.

The final major optimization is to take the Select on E.dept_id = D.dept_id and combine it with the Cross Product. This turns it into a relational algebra Equijoin operation. This doesn't do much by itself. But if there's an index on Department.dept_id, then the lower level sequential Scan of Department feeding the Equijoin can be turned into a very fast index Lookup of our one employee's Department record.

较少的优化涉及下推 Project 操作.如果查询的顶层只需要E.name和D.name,而条件则需要E.id,E.dept_id和D.dept_id,则 Scan 操作不必与其他所有列一起构建中间表,从而在查询执行期间节省空间.我们已经将一个非常慢的查询变成了两个索引查询,而没有太多其他事情了.

Lesser optimizations involve pushing Project operations down. If the top level of your query just needs E.name and D.name, and the conditions need E.id, E.dept_id, and D.dept_id, then the Scan operations don't have to build intermediate tables with all the other columns, saving space during the query execution. We've turned a horribly slow query into two index lookups and not much else.

进一步了解原始问题,假设您已经:

Getting more towards the original question, let's say you've got:

select E.name 
from Employee E 
where E.age > 21 and E.state = 'Delaware'

未经优化的关系代数树在执行时会扫描5,000名员工,并产生特拉华州的126名年龄超过21岁的员工.查询优化器对数据库中的值也有一些粗略的了解.它可能知道E.state列包含公司所在的14个州,以及有关E.age分布的信息.因此,它首先查看是否对两个字段都建立了索引.如果是E.state,则使用该索引根据其最近计算的统计信息来挑选出查询处理器怀疑在特拉华州的少数雇员是有意义的.如果只有E.age,查询处理器可能会认为这不值得,因为96%的员工年龄在22岁以上.因此,如果对E.state进行了索引,则我们的查询处理器会破坏 Select 并将E.state ='Delaware'与 Scan 合并,以将其转化为效率更高的索引扫描.

The unoptimized relational algebra tree, when executed, would Scan in the 5,000 employees and produce, say, the 126 ones in Delaware who are older than 21. The query optimizer also has some rough idea of the values in the database. It might know that the E.state column has the 14 states that the company has locations in, and something about the E.age distributions. So first it sees if either field is indexed. If E.state is, it makes sense to use that index to just pick out the small number of employees the query processor suspects are in Delaware based on its last computed statistics. If only E.age is, the query processor likely decides that it's not worth it, since 96% of all employees are 22 and older. So if E.state is indexed, our query processor breaks the Select and merges the E.state = 'Delaware' with the Scan to turn it into a much more efficient Index Scan.

在此示例中,假设在E.state和E.age上没有索引.组合的 Select 操作在Employee的顺序扫描"之后进行.首先完成 Select 中的哪个条件是否有所不同?可能不是很多.查询处理器可能会将它们保留在SQL语句中的原始顺序中,或者可能更复杂一些,并查看预期的费用.从统计数据中,它会再次发现E.state =特拉华"条件应该具有更高的选择性,因此它将颠倒条件并首先执行该条件,因此只有126 E.age> 21个比较,而不是5,000个.或者它可能认识到字符串相等性比较比整数比较昂贵得多,并且不考虑顺序.

Let's say in this example that there are no indexes on E.state and E.age. The combined Select operation takes place after the sequential "Scan" of Employee. Does it make a difference which condition in the Select is done first? Probably not a great deal. The query processor might leave them in the original order in the SQL statement, or it might be a bit more sophisticated and look at the expected expense. From the statistics, it would again find that the E.state = 'Delaware' condition should be more highly selective, so it would reverse the conditions and do that first, so that there are only 126 E.age > 21 comparisons instead of 5,000. Or it might realize that string equality comparisons are much more expensive than integer compares and leave the order alone.

无论如何,这一切都是非常复杂的,您的句法条件顺序很难改变.除非您遇到实际的性能问题并且您的数据库供应商使用条件顺序作为提示,否则我不会担心.

At any rate, all this is very complex and your syntactic condition order is very unlikely to make a difference. I wouldn't worry about it unless you have a real performance problem and your database vendor uses the condition order as a hint.

这篇关于WHERE子句中的字段顺序是否会影响MySQL的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆