通过Join-Where-Group通过选择查询来正确索引,避免使用临时索引;使用文件排序 [英] Correct indexing by Join-Where-Group By select queries avoiding Using temporary; Using filesort

查看:84
本文介绍了通过Join-Where-Group通过选择查询来正确索引,避免使用临时索引;使用文件排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经搜索了很多有关下述情况的解决方案,但不幸的是,我还没有找到类似的情况.

I've searched a lot about a solution of the described below case, but unfortunately I haven't found a similar case.

我有以下情况: (作为新用户,该网站拒绝了我的图片,但是我可以通过邮件发送它.下面是它的文字表示)

I have the following scenario: (as a new user the site rejected my picture, but I can send it via mail. Below is a textual representation of it)

Table 1 "swap_plan"          Table 2 "cell"
ClusterName | SiteID         SiteID    | Cell      |  Time       | Counter
-----------------------      ---------------------------------------------
Cluster A   | SiteID A1      SiteID A1 | Cell A1-1 | day1        | 5
Cluster A   | SiteID A2      SiteID A1 | Cell A1-1 | day2        | 3
Cluster A   | SiteID A3      SiteID A1 | Cell A1-1 | day3        | 6
Cluster A   | SiteID A4      SiteID A1 | Cell A1-2 | day1        | 6
Cluster A   | SiteID A5      SiteID A1 | Cell A1-2 | day2        | 2
Cluster A   | SiteID A6      SiteID A1 | Cell A1-2 | day3        | 9
.......................      ..............................................
Cluster B   | .........      ..............................................

(Where No 1)      (ON Clause "SiteID")            (Where No 2)    Sum(Counter)

我必须显示一些性能指标(表2"cell"中的"Counter"),随时间(表2"cell"中的"Time")和群集(表1"swap_plan"中的"ClusterName")汇总的性能指标.

I have to display some performance indicators ("Counter" from table 2 "cell"), aggregated over time ("Time" from table 2 "cell") and cluster ("ClusterName" from table 1 "swap_plan").

通过两个表"SiteID"的公共列完成连接.请注意,在表2单元"中,每个SiteID均包含3个不同的对象(单元").因此,实际上我为每个单元格执行了"Counter"的SUM().

The join is done via the common column for both tables "SiteID". Please, note, that in Table 2 "cell" each SiteID consists of 3 different objects ("Cell"). So, in fact I do SUM() of "Counter" for each Cell.

查询如下:

SELECT ClusterName,Time,SUM(counter)
FROM cell
INNER JOIN swap_plan ON swap_plan.Siteid = cell.Siteid
WHERE ClusterName='Cluster A' AND Time>=day1 AND Time<=day2
GROUP BY Time

列类型如下:

表1交换计划":

  1. 群集名称-CHAR(30)
  2. SiteID-VARCHAR(10)

表2单元格":

  1. SiteID-VARCHAR(10)
  2. 时间-DATETIME
  3. 计数器-INT

说明"显示如下:

table          type    key           key_len      ref               rows  Extra

swap_plan      ref     Index 1       30           const             31    Using where; Using index; Using temporary; Using filesort
cell           ref     Index_siteid  13           swap_plan.SiteID  368   Using where

使用的索引如下:

swap_plan:索引1(1.ClusterName和2.SiteID)

swap_plan: Index 1 (1. ClusterName and 2. SiteID)

单元格:Index_siteid(SiteID)

cell: Index_siteid (SiteID)

优化器在其中查找的行数很少,这很好:

The number of rows, in which the optimizer looks is rathet low, which is good:

交换计划:6066中有31个,单元格:660万中有368个.

swap_plan: 31 out of 6066 and cell: 368 out of 6.6 mil.

我的问题是这些使用临时;使用文件排序".据我了解,这来自于分组依据所需的排序(如果删除它,则不会按照说明执行这些过程).我发现为了避免它们,您需要在分组依据的列上有一个索引.我有一个特殊的索引,仅包含时间"列,但是即使有提示使用索引用于GROUP BY()",也不会使用该索引.

My problem is these "Using temporary; Using filesort". As far as I understand this comes from the sorting needed for Group By (If I remove it, these processes are not executed according to Explain). I found that in order to avoid them you need to have an index on the columns by which you group by. I have a special index including only "Time" column, but this one is not used, even with a hint "USE INDEX FOR GROUP BY ()".

结果,我的查询运行速度不够快-大约需要15秒(例如15个SiteID和10个日期),因此我需要将持续时间至少缩短一半.

As a result my query runs not sufficiently fast - it takes about 15 seconds (for let's say 15 SiteIDs and 10 dates) and I need to reduce this duration to at least half of it.

我的主要问题是:

  • 完全可以删除使用临时;使用文件排序"或 减少执行它们所需的时间? (我试图增加 读取缓冲区大小为16MB,无效)
  • 在JOIN情况下,当我在WHERE子句中按不同表中的2列进行过滤,而在ON子句中按第三列进行过滤时,我需要哪种索引定义
  • 我可以应用哪种类型的分组优化(索引等)?
  • Is is possible at all to remove "Using temporary; Using filesort" or to reduce the needed time for their execution? (I tried to increase Read Buffer Size to 16MB, without effect)
  • What kind of index definitions I need in JOIN situations, when in WHERE clause I filter by 2 columns in different tables and in ON clause I filter by a 3rd column
  • What kind of Group By optimization I can apply (indexing, etc.)?

非常感谢您!

推荐答案

我会这样写查询:

SELECT c.time
     , SUM(c.counter)
     , MAX(p.clustername) AS clustername
  FROM cell c

  JOIN swap_plan p
    ON p.siteid      = c.siteid
   AND p.clustername = 'Cluster A'

 WHERE c.time  >=  'day1'
   AND c.time  <=  'day2'
 GROUP
    BY c.time

我一定要在cell上有一个索引,并以time作为前导列.

I'd be sure to have an index on cell with time as the leading column.

MySQL可以使用相同的索引来满足范围谓词(在WHERE子句中),并满足GROUP BY,而无需执行使用文件排序"操作.

MySQL can use the same index to satisfy the range predicate (in the WHERE clause), and to satisfy the GROUP BY without a "Using filesort" operation.

... ON cell (time)

根据列的大小,覆盖索引可能会提供最佳性能.覆盖索引包含查询中引用的表中的所有列,因此可以完全从索引页面满足查询,而无需查找基础表中的页面.

Depending on the sizes of the columns, a covering index might give optimal performance. A covering index includes all of the columns from the table that are referenced in the query, so the query can be satisfied entirely from index pages without lookup to pages in the underlying table.

... ON cell (time, siteid, counter)

对于swap_plan上的索引,我有一个以site_id作为前导列的索引,并且包括clustername列,其中之一:

For the index on swap_plan, I'd have an index with site_id as the leading column, and including the clustername column, either of:

... ON swap_plan (clustername, site_id)

... ON swap_plan (site_id, clustername)

这两个列的组合可能会有一个UNIQUE约束,即,对于给定的clusternamesite_id的值将是不同的. (如果不是这种情况,并且同一(site_id,clustername)元组出现多次,则可能总计counter的合计值会被夸大.

Looks likely there is going to be a UNIQUE constraint on the combination of those two columns, i.e. the values of site_id will be distinct for a given clustername. (If that isn't the case, and the same (site_id,clustername) tuple appears multiple times, there's potential for aggregate total of counter to be inflated.

我会寻找EXPLAIN输出,以显示c.siteid值和簇名称的const(字面"Cluster A")值来显示对swap_plan表的引用".

I'd be looking for the EXPLAIN output to show a 'ref' lookup to swap_plan table from the value of c.siteid and const (literal 'Cluster A') value for clustername.

对于具有31行和368行的表,我们不会在最佳执行计划和可怕的执行计划之间看到性能(经过的时间)的显着差异.

With tables sized at 31 rows and 368 rows, we aren't going to see a significant difference in performance (elapsed time) between an optimal execution plan and a horrible execution plan.

当两个表中的任何一个都可扩展到数百万行时,差异就会变得明显.执行计划的优化器选择受每个表的统计信息(大小,行数,列基数)的影响,因此执行计划可能会随着表大小的增加而改变.

When either of the tables scales up to millions of rows, that's when the differences will become apparent. The optimizers choice of execution plan is influenced by statistics (size, number of rows, column cardinality) of each table, so the execution plan could change with an increase in table sizes.

这篇关于通过Join-Where-Group通过选择查询来正确索引,避免使用临时索引;使用文件排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆