MySQL服务器上非常简单的AVG()聚合查询要花费很长时间 [英] Very simple AVG() aggregation query on MySQL server takes ridiculously long time

查看:145
本文介绍了MySQL服务器上非常简单的AVG()聚合查询要花费很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过默认服务通过Amazon Can服务使用MySQL服务器. mytable涉及的表是InnoDB类型,大约有10亿行. 查询是:

select count(*), avg(`01`) from mytable where `date` = "2017-11-01";

大约需要10分钟才能执行.我在date上有一个索引.此查询的EXPLAIN是:

+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table         | type | possible_keys | key  | key_len | ref   | rows    | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
|  1 | SIMPLE      | mytable       | ref  | date          | date | 3       | const | 1411576 | NULL  |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+

此表中的索引为:

+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable       |          0 | PRIMARY   |            1 | ESI         | A         |    60398679 |     NULL | NULL   |      | BTREE      |         |               |
| mytable       |          0 | PRIMARY   |            2 | date        | A         |  1026777555 |     NULL | NULL   |      | BTREE      |         |               |
| mytable       |          1 | lse_cd    |            1 | lse_cd      | A         |     1919210 |     NULL | NULL   | YES  | BTREE      |         |               |
| mytable       |          1 | zone      |            1 | zone        | A         |      732366 |     NULL | NULL   | YES  | BTREE      |         |               |
| mytable       |          1 | date      |            1 | date        | A         |    85564796 |     NULL | NULL   |      | BTREE      |         |               |
| mytable       |          1 | ESI_index |            1 | ESI         | A         |     6937686 |     NULL | NULL   |      | BTREE      |         |               |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

如果我删除AVG():

select count(*) from mytable where `date` = "2017-11-01";

仅需0.15秒即可返回计数.此特定查询的计数为692792;其他date的计数相似.

我在01上没有索引.有问题吗?为什么AVG()需要那么长时间才能计算?肯定有某些我做得不好的事情.

任何建议都值得赞赏!

解决方案

要计算具有特定日期的行数,MySQL必须在索引中定位该值(这非常快,毕竟这就是索引是什么) ),然后读取索引的后续条目 ,直到找到下一个日期为止.根据esi的数据类型,这将总计需要读取MB的数据来计算700k行.读取一些MB并不会花费很多时间(而且数据甚至可能已经缓存在缓冲池中,具体取决于您使用索引的频率).

为计算未包含在索引中的列的平均值,MySQL将再次使用索引查找该日期的所有行(与以前相同).但是此外,对于找到的每一行,它必须读取该行的实际表数据,这意味着使用主键定位该行,读取一些字节并重复此700k次.此比第一种情况下的顺序读取要慢很多. (由于某些字节"是

对此的一种解决方案是在索引(覆盖索引")中包含所有使用的列,例如在date, 01上创建索引.然后,MySQL不需要访问表本身,并且可以像第一种方法一样通过仅读取索引来继续进行操作.索引的大小将增加一点,因此MySQL将需要读取更多的MB"(并执行avg-运算),但仍然需要几秒钟的时间.

在评论中,您提到需要计算24列的平均值.如果要同时计算几列的avg,则需要在所有列上都包含覆盖索引,例如date, 01, 02, ..., 24防止表访问.请注意,包含所有列的索引所需的存储空间与表本身一样多(创建此类索引将花费很长时间),因此,是否值得这些资源取决于该查询的重要性.

为避免 MySQL限制为16列每个索引,您可以将其分为两个索引(和两个查询).创建例如索引date, 01, .., 12date, 13, .., 24,然后使用

select * from (select `date`, avg(`01`), ..., avg(`12`) 
               from mytable where `date` = ...) as part1
cross join    (select avg(`13`), ..., avg(`24`) 
               from mytable where `date` = ...) as part2;

请确保对此进行正确记录,因为没有明显的理由以这种方式编写查询,但这可能是值得的.

如果仅对一列进行平均,则可以添加24个单独的索引(在date, 01date, 02,...上),尽管它们总共将需要更多的空间,但可能会略有增加速度更快(因为它们分别更小).但是缓冲池可能仍然倾向于全索引,具体取决于使用模式和内存配置等因素,因此您可能必须对其进行测试.

由于date是主键的一部分,因此您也可以考虑将主键更改为date, esi.如果通过主键找到日期,则不需要其他步骤来访问表数据(因为您已经访问了表),因此其行为类似于覆盖索引.但这是对表的重大更改,可能会影响所有其他查询(例如使用esi来定位行),因此必须谨慎考虑.

正如您提到的,另一种选择是建立一个汇总表,用于存储预先计算的值,特别是如果您不为过去的日期添加或修改行(或者可以使用触发器使它们保持最新). /p>

I am using MySQL server via Amazon could service, with default settings. The table involved mytable is of InnoDB type and has about 1 billion rows. The query is:

select count(*), avg(`01`) from mytable where `date` = "2017-11-01";

Which takes almost 10 min to execute. I have an index on date. The EXPLAIN of this query is:

+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table         | type | possible_keys | key  | key_len | ref   | rows    | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
|  1 | SIMPLE      | mytable       | ref  | date          | date | 3       | const | 1411576 | NULL  |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+

The indexes from this table are:

+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable       |          0 | PRIMARY   |            1 | ESI         | A         |    60398679 |     NULL | NULL   |      | BTREE      |         |               |
| mytable       |          0 | PRIMARY   |            2 | date        | A         |  1026777555 |     NULL | NULL   |      | BTREE      |         |               |
| mytable       |          1 | lse_cd    |            1 | lse_cd      | A         |     1919210 |     NULL | NULL   | YES  | BTREE      |         |               |
| mytable       |          1 | zone      |            1 | zone        | A         |      732366 |     NULL | NULL   | YES  | BTREE      |         |               |
| mytable       |          1 | date      |            1 | date        | A         |    85564796 |     NULL | NULL   |      | BTREE      |         |               |
| mytable       |          1 | ESI_index |            1 | ESI         | A         |     6937686 |     NULL | NULL   |      | BTREE      |         |               |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

If I remove AVG():

select count(*) from mytable where `date` = "2017-11-01";

It only takes 0.15 sec to return the count. The count of this specific query is 692792; The counts are similar for other dates.

I don't have an index over 01. Is it an issue? Why AVG() takes so long to compute? There must be something I didn't do properly.

Any suggestion is appreciated!

解决方案

To count the number of rows with a specific date, MySQL has to locate that value in the index (which is pretty fast, after all that is what indexes are made for) and then read the subsequent entries of the index until it finds the next date. Depending on the datatype of esi, this will sum up to reading some MB of data to count your 700k rows. Reading some MB does not take much time (and that data might even already be cached in the buffer pool, depending on how often you use the index).

To calculate the average for a column that is not included in the index, MySQL will, again, use the index to find all rows for that date (the same as before). But additionally, for every row it finds, it has to read the actual table data for that row, which means to use the primary key to locate the row, read some bytes, and repeat this 700k times. This "random access" is a lot slower than the sequential read in the first case. (This gets worse by the problem that "some bytes" is the innodb_page_size (16KB by default), so you may have to read up to 700k * 16KB = 11GB, compared to "some MB" for count(*); and depending on your memory configuration, some of this data might not be cached and has to be read from disk.)

A solution to this is to include all used columns in the index (a "covering index"), e.g. create an index on date, 01. Then MySQL does not need to access the table itself, and can proceed, similar to the first method, by just reading the index. The size of the index will increase a bit, so MySQL will need to read "some more MB" (and perform the avg-operation), but it should still be a matter of seconds.

In the comments, you mentioned that you need to calculate the average over 24 columns. If you want to calculate the avg for several columns at the same time, you would need a covering index on all of them, e.g. date, 01, 02, ..., 24 to prevent table access. Be aware that an index that contains all columns requires as much storage space as the table itself (and it will take a long time to create such an index), so it might depend on how important this query is if it is worth those resources.

To avoid the MySQL-limit of 16 columns per index, you could split it into two indexes (and two queries). Create e.g. the indexes date, 01, .., 12 and date, 13, .., 24, then use

select * from (select `date`, avg(`01`), ..., avg(`12`) 
               from mytable where `date` = ...) as part1
cross join    (select avg(`13`), ..., avg(`24`) 
               from mytable where `date` = ...) as part2;

Make sure to document this well, as there is no obvious reason to write the query this way, but it might be worth it.

If you only ever average over a single column, you could add 24 seperate indexes (on date, 01, date, 02, ...), although in total, they will require even more space, but might be a little bit faster (as they are smaller individually). But the buffer pool might still favour the full index, depending on factors like usage patterns and memory configuration, so you may have to test it.

Since date is part of your primary key, you could also consider changing the primary key to date, esi. If you find the dates by the primary key, you would not need an additional step to access the table data (as you already access the table), so the behaviour would be similar to the covering index. But this is a significant change to your table and can affect all other queries (that e.g. use esi to locate rows), so it has to be considered carefully.

As you mentioned, another option would be to build a summary table where you store precalculated values, especially if you do not add or modify rows for past dates (or can keep them up-to-date with a trigger).

这篇关于MySQL服务器上非常简单的AVG()聚合查询要花费很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆