SQL性能:使用OR时,使用OR的速度比IN慢 [英] SQL Performance: Using OR is slower than IN when using order by

查看:126
本文介绍了SQL性能:使用OR时,使用OR的速度比IN慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用MariaDB 10.0.21,并针对1200万行运行类似于以下查询的查询:

I am using MariaDB 10.0.21 and running a query similar to the following query against 12 Million Rows:

SELECT 
    `primary_key` 
FROM 
    `texas_parcels` 
WHERE 
    `zip_code`
         IN ('28461', '48227', '60411', '65802', '75215', '75440', '75773', '75783', '76501', '76502', '76504', '76511', '76513', '76518', '76519', '76520', '76522', '76525', '76527', '76528', '76530', '76537', '76539', '76541', '76542', '76543', '76548', '76549', '76550', '76556', '76567', '76571', '76574', '76577', '76578', '76642', '76704', '76853', '77418', '77434', '77474', '77833', '77835', '77836', '77845', '77853', '77879', '77964', '77975', '78002', '78003', '78006', '78013', '78028', '78056', '78064', '78070', '78114', '78123', '78130', '78132', '78133', '78154', '78155', '78359', '78382', '78602', '78605', '78606', '78607', '78608', '78609', '78610', '78611', '78612', '78613', '78614', '78615', '78616', '78617', '78619', '78620', '78621', '78623', '78624', '78626', '78628', '78629', '78632', '78633', '78634', '78636', '78638', '78639', '78640', '78641', '78642', '78643', '78644', '78645', '78648', '78650', '78652', '78653', '78654', '78655', '78656', '78657', '78659', '78660', '78662', '78663', '78664', '78665', '78666', '78669', '78672', '78676', '78681', '78699', '78701', '78702', '78703', '78704', '78705', '78717', '78719', '78721', '78722', '78723', '78724', '78725', '78726', '78727', '78728', '78729', '78730', '78731', '78732', '78733', '78734', '78735', '78736', '78737', '78738', '78739', '78741', '78744', '78745', '78746', '78747', '78748', '78749', '78750', '78751', '78752', '78753', '78754', '78756', '78757', '78758', '78759', '78828', '78934', '78940', '78941', '78942', '78945', '78946', '78947', '78948', '78953', '78954', '78956', '78957', '78963', '92536') 
ORDER BY 
    `timestamp_updated` ASC
LIMIT 1000;

我在(zip_code,timestamp_updated)上有一个复合索引,我在〜1.6秒内获得了结果.在下一个查询中,我仍在搜索相同的邮政编码,但我使用的是 OR 而不是 IN().

I have a compound index on (zip_code,timestamp_updated), I obtain results in ~1.6 Seconds. In the next query I am still seraching the same zipcodes but I am using OR instead of IN().

SELECT 
    `primary_key`
FROM 
    `texas_parcels` 
WHERE
(`zip_code` = '28461' OR `zip_code` = '48227' OR `zip_code` = '60411' OR `zip_code` = '65802' OR `zip_code` = '75215' OR `zip_code` = '75440' OR `zip_code` = '75773' OR `zip_code` = '75783' OR `zip_code` = '76501' OR `zip_code` = '76502' OR `zip_code` = '76504' OR `zip_code` = '76511' OR `zip_code` = '78957' OR `zip_code` = '78963' OR `zip_code` = '92536' OR `zip_code` = '76513' OR `zip_code` = '76518' OR `zip_code` = '76519' OR `zip_code` = '76520' OR `zip_code` = '76522' OR `zip_code` = '76525' OR `zip_code` = '76527' OR `zip_code` = '76528' OR `zip_code` = '76530' OR `zip_code` = '76537' OR `zip_code` = '76539' OR `zip_code` = '76541' OR `zip_code` = '76542' OR `zip_code` = '76543' OR `zip_code` = '76548' OR `zip_code` = '76549' OR `zip_code` = '76550' OR `zip_code` = '76556' OR `zip_code` = '76567' OR `zip_code` = '76571' OR `zip_code` = '76574' OR `zip_code` = '76577' OR `zip_code` = '76578' OR `zip_code` = '76642' OR `zip_code` = '76704' OR `zip_code` = '76853' OR `zip_code` = '77418' OR `zip_code` = '77434' OR `zip_code` = '77474' OR `zip_code` = '77833' OR `zip_code` = '77835' OR `zip_code` = '77836' OR `zip_code` = '77845' OR `zip_code` = '77853' OR `zip_code` = '77879' OR `zip_code` = '77964' OR `zip_code` = '77975' OR `zip_code` = '78002' OR `zip_code` = '78003' OR `zip_code` = '78006' OR `zip_code` = '78013' OR `zip_code` = '78028' OR `zip_code` = '78056' OR `zip_code` = '78064' OR `zip_code` = '78070' OR `zip_code` = '78114' OR `zip_code` = '78123' OR `zip_code` = '78130' OR `zip_code` = '78132' OR `zip_code` = '78133' OR `zip_code` = '78154' OR `zip_code` = '78155' OR `zip_code` = '78359' OR `zip_code` = '78382' OR `zip_code` = '78602' OR `zip_code` = '78605' OR `zip_code` = '78606' OR `zip_code` = '78607' OR `zip_code` = '78608' OR `zip_code` = '78609' OR `zip_code` = '78610' OR `zip_code` = '78611' OR `zip_code` = '78612' OR `zip_code` = '78613' OR `zip_code` = '78614' OR `zip_code` = '78615' OR `zip_code` = '78616' OR `zip_code` = '78617' OR `zip_code` = '78619' OR `zip_code` = '78620' OR `zip_code` = '78621' OR `zip_code` = '78623' OR `zip_code` = '78624' OR `zip_code` = '78626' OR `zip_code` = '78628' OR `zip_code` = '78629' OR `zip_code` = '78632' OR `zip_code` = '78633' OR `zip_code` = '78634' OR `zip_code` = '78636' OR `zip_code` = '78638' OR `zip_code` = '78639' OR `zip_code` = '78640' OR `zip_code` = '78641' OR `zip_code` = '78642' OR `zip_code` = '78643' OR `zip_code` = '78644' OR `zip_code` = '78645' OR `zip_code` = '78648' OR `zip_code` = '78650' OR `zip_code` = '78652' OR `zip_code` = '78653' OR `zip_code` = '78654' OR `zip_code` = '78655' OR `zip_code` = '78656' OR `zip_code` = '78657' OR `zip_code` = '78659' OR `zip_code` = '78660' OR `zip_code` = '78662' OR `zip_code` = '78663' OR `zip_code` = '78664' OR `zip_code` = '78665' OR `zip_code` = '78666' OR `zip_code` = '78669' OR `zip_code` = '78672' OR `zip_code` = '78676' OR `zip_code` = '78681' OR `zip_code` = '78699' OR `zip_code` = '78701' OR `zip_code` = '78702' OR `zip_code` = '78703' OR `zip_code` = '78704' OR `zip_code` = '78705' OR `zip_code` = '78717' OR `zip_code` = '78719' OR `zip_code` = '78721' OR `zip_code` = '78722' OR `zip_code` = '78723' OR `zip_code` = '78724' OR `zip_code` = '78725' OR `zip_code` = '78726' OR `zip_code` = '78727' OR `zip_code` = '78728' OR `zip_code` = '78729' OR `zip_code` = '78730' OR `zip_code` = '78731' OR `zip_code` = '78732' OR `zip_code` = '78733' OR `zip_code` = '78734' OR `zip_code` = '78735' OR `zip_code` = '78736' OR `zip_code` = '78737' OR `zip_code` = '78738' OR `zip_code` = '78739' OR `zip_code` = '78741' OR `zip_code` = '78744' OR `zip_code` = '78745' OR `zip_code` = '78746' OR `zip_code` = '78747' OR `zip_code` = '78748' OR `zip_code` = '78757' OR `zip_code` = '78758' OR `zip_code` = '78759' OR `zip_code` = '78828' OR `zip_code` = '78934' OR `zip_code` = '78940' OR `zip_code` = '78941' OR `zip_code` = '78942' OR `zip_code` = '78945' OR `zip_code` = '78946' OR `zip_code` = '78947' OR `zip_code` = '78948' OR `zip_code` = '78953' OR `zip_code` = '78954' OR `zip_code` = '78956')
ORDER BY 
    `timestamp_updated` ASC
LIMIT 1000;

第二个查询以相同的顺序在〜7.8秒中获得相同的结果.通过解释运行每个查询时,它们几乎相同,它们给我的rows金额略有不同.

This second query obtains the same results in the same order in ~7.8 Seconds. When running each query through an explain, they are nearly the same, they give me a slightly different rows amount.

id  select_type     table        type      possible_keys          key            key_len    ref        rows       filtered              Extra
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1     SIMPLE     TX_Property   range    Zip Code Search     Zip Code Search       15     (NULL)      2402699     99.88      Using where; Using index; Using filesort  
 2     SIMPLE     TX_Property   range    Zip Code Search     Zip Code Search       15     (NULL)      2321908     99.91      Using where; Using index; Using filesort 

对两个查询进行概要分析时,唯一的主要时间增加是Sorting Result时间,在第二个查询中,花费的时间最多为 7.2秒.

When Profiling the two queries the only major increase in time is the Sorting Result time, in the second query it took up to 7.2 Seconds.

我想我不明白的是,不同的运算符在定单方面如何产生如此巨大的差异,如果执行时间存在重大差异是否有意义?也许我不是很确定配置文件是如何工作的,实际上这只是执行where部分的时间,但是它只是以复杂的方式标记了吗?

I guess what I don't understand is how the different operator in the where could make such a huge difference when it comes to order, it would make sense if there was a major difference in time for executing? Maybe I am just not exactly sure how profiling works, and it actually is just the time of executing the where portion, but it is just labeled in a convoluted way?

我还想指出,当我在没有ORDER BY timestamp_updated ASC的情况下运行查询时,最上面的查询花费了〜0.106秒,而第二个查询花费了〜0.157秒.

I also wanted to note that when I ran the queries without the ORDER BY timestamp_updated ASC the top query took ~0.106 Seconds and the second query took ~0.157 Seconds.

推荐答案

删除ORDER BY的运行速度如此之快,因为它只能在1000行之后停止.多少行与该OR/IN相匹配?

Removing the ORDER BY runs so much faster because it can stop after only 1000 rows. How many rows match that OR/IN?

请注意,EXPLAINs说查询是Using index.这意味着您有一个覆盖"索引.那就是SELECT中的所有字段都在一个索引中.

Notice that the EXPLAINs say that the query is Using index. That means that you have a "covering" index. That is all the fields in the SELECT are in the one index.

在InnoDB中,每个辅助键都隐式包含PK,因此INDEX(zip_code, timestamp_updated)实际上是INDEX(zip_code, timestamp_updated, primaryKey)

In InnoDB, each secondary key implicitly includes the PK, so INDEX(zip_code, timestamp_updated) is effectively INDEX(zip_code, timestamp_updated, primaryKey)

索引不是很有效,因为您要进行两个非常重要的事情:(1)IN或OR,(2)ORDER BY.索引只能处理其中一个.您的索引允许它使用zip_code.它

The index is not very efficient since you have two non-trivial things going on: (1) IN or OR, (2) ORDER BY. Only one or the other can be handled by the index. Your index lets it use zip_code. It

  1. 在索引中找到与任何邮政编码匹配的行
  2. 收集时间戳和pk,将3列放入tmp表
  3. 排序
  4. 提供前1000个.
  1. finds the rows in the index that match any of those zipcodes,
  2. gathers the timestamp and pk, putting the 3 columns in a tmp table
  3. sorts
  4. delivers the first 1000.

相反,如果您说INDEX(timestamp_updated, zip_code),您仍然会有一个覆盖"索引,但是按照这种风格,该索引将(希望)避免对SORT的需要.哦,鉴于此,它可能能够在1000行之后停止.运作方式如下:

If, instead, you said INDEX(timestamp_updated, zip_code) you would still have a 'covering' index, but in this flavor, the index would (I hope) prevent the need for the SORT. Oh, given that, it might be able to stop after 1000 rows. Here's how it will work:

  1. 按时间戳顺序浏览索引.
  2. 检查每一行是否是其中的一个zip. (这里的测试可能以IN格式更快)
  3. 如果匹配,则发送行;如果是1000,则停止.
  1. Scan through the index in timestamp order.
  2. Check each row for being one of those zips. (Here the test might be faster in IN format)
  3. If match, deliver row; if 1000, stop.

但是等等...现在您处于1200万行的摆布位置. 如果带有这些zip的1000行出现得较早(旧时间戳记),它可能会很快停止. 如果您需要检查所有行以找到1000(或者甚至没有1000),则这是一次完整的索引扫描,并且索引的这种排列方式是错误的".

But wait... Now you are at the mercy of the 12M rows. If 1000 rows with those zips occur early (old timestamps), it can stop fast. If you need to check all the rows to find 1000 (or there aren't even 1000), then it is a full index scan, and this arrangement of the index is 'bad'.

如果同时给优化器INDEXes,则它会根据信息不足(值没有分配)来尽职地做出明智的选择,并且可能会选择更差的一个.

If you give the optimizer both INDEXes, it will dutifully make an intelligent choice based on inadequate information (no distribution of the values), and might pick the worse one.

您实际上需要一个二维索引.这样不存在. (好吧,也许Spatial可能会出错?)但是...

You effectively need a 2-dimensional index. Such don't exist. (Well, maybe Spatial could be kludged?) But...

PARTITION BY RANGE(timestamp)以及INDEX 起始和zip 可能的压缩效果更好.但是我怀疑优化器是否足够聪明,以至于如果它在第一个分区中发现1000行,就可以退出.如果没有1000个结果,它仍然会严重失败.

PARTITION BY RANGE(timestamp) together with the INDEX starting with zip might work better. But I doubt if the optimizer is smart enough to realize that if it found 1000 rows in the first partition it could quit. And it still fails badly if there aren't 1000 results.

PARTITION BY RANGE(zip)INDEX 以可能为时间戳的开始一起将无济于事,因为很多拉链不会做很多修剪.

PARTITION BY RANGE(zip) together with the INDEX starting with timestamp probably will not help, since that many zips won't do much pruning.

请为您的每次尝试提供EXPLAIN FORMAT=JSON SELECT...;.那里可能有一些微妙的线索来解释时间的宽泛变化.

Please provide EXPLAIN FORMAT=JSON SELECT...; for each of your attempts. There may be some subtle clues there to explain the wide time variations.

您每次计时运行两次吗? (否则,缓存可能会使结果变色.)

Did you run each timing twice? (Otherwise, caching may have colored the results.)

另一种方法

我不知道它的执行效果如何,但是这里有:

I do not know how well this will perform, but here goes:

SELECT  primary_key
    FROM  ( 
              ( SELECT  primary_key, timestamp_updated
                    FROM  texas_parcels
                    WHERE  zip_code = '28461'
                    ORDER BY  timestamp_updated
                    LIMIT  1000 
              )
            UNION  ALL (
                SELECT  primary_key, timestamp_updated
                    FROM  texas_parcels
                    WHERE  zip_code = '48227'
                    ORDER BY  timestamp_updated
                    LIMIT  1000 
                       )
            UNION  ALL (
                SELECT  primary_key, timestamp_updated
                    FROM  texas_parcels
                    WHERE  zip_code = '60411'
                    ORDER BY  timestamp_updated
                    LIMIT  1000 ) ... 
          ) x
    ORDER BY  timestamp_updated
    LIMIT  1000 

x似乎只有几十万行,而不是130万行.但是UNION有一些开销,等等.请注意每个子查询中和外部的LIMIT.如果您也需要OFFSET,它将变得更加棘手.

It seems like x will have only a few hundred thousand rows, not 1.3M. But UNION has some overhead, etc. Note the LIMIT in each subquery and on the outside. If you need OFFSET, too, it gets trickier.

这篇关于SQL性能:使用OR时,使用OR的速度比IN慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆