MySQL服务器上非常简单的AVG()聚合查询要花费很长时间 [英] Very simple AVG() aggregation query on MySQL server takes ridiculously long time
问题描述
我正在通过默认服务通过Amazon Can服务使用MySQL服务器. mytable
涉及的表是InnoDB
类型,大约有10亿行.
查询是:
select count(*), avg(`01`) from mytable where `date` = "2017-11-01";
大约需要10分钟才能执行.我在date
上有一个索引.此查询的EXPLAIN
是:
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| 1 | SIMPLE | mytable | ref | date | date | 3 | const | 1411576 | NULL |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
此表中的索引为:
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable | 0 | PRIMARY | 1 | ESI | A | 60398679 | NULL | NULL | | BTREE | | |
| mytable | 0 | PRIMARY | 2 | date | A | 1026777555 | NULL | NULL | | BTREE | | |
| mytable | 1 | lse_cd | 1 | lse_cd | A | 1919210 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | zone | 1 | zone | A | 732366 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | date | 1 | date | A | 85564796 | NULL | NULL | | BTREE | | |
| mytable | 1 | ESI_index | 1 | ESI | A | 6937686 | NULL | NULL | | BTREE | | |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
如果我删除AVG()
:
select count(*) from mytable where `date` = "2017-11-01";
仅需0.15秒即可返回计数.此特定查询的计数为692792;其他date
的计数相似.
我在01
上没有索引.有问题吗?为什么AVG()
需要那么长时间才能计算?肯定有某些我做得不好的事情.
任何建议都值得赞赏!
要计算具有特定日期的行数,MySQL必须在索引中定位该值(这非常快,毕竟这就是索引是什么) ),然后读取索引的后续条目 ,直到找到下一个日期为止.根据esi
的数据类型,这将总计需要读取MB的数据来计算700k行.读取一些MB并不会花费很多时间(而且数据甚至可能已经缓存在缓冲池中,具体取决于您使用索引的频率).
为计算未包含在索引中的列的平均值,MySQL将再次使用索引查找该日期的所有行(与以前相同).但是此外,对于找到的每一行,它必须读取该行的实际表数据,这意味着使用主键定位该行,读取一些字节并重复此700k次.此比第一种情况下的顺序读取要慢很多. (由于某些字节"是
对此的一种解决方案是在索引(覆盖索引")中包含所有使用的列,例如在 在评论中,您提到需要计算24列的平均值.如果要同时计算几列的 为避免 MySQL限制为16列每个索引,您可以将其分为两个索引(和两个查询).创建例如索引 请确保对此进行正确记录,因为没有明显的理由以这种方式编写查询,但这可能是值得的. 如果仅对一列进行平均,则可以添加24个单独的索引(在 由于 正如您提到的,另一种选择是建立一个汇总表,用于存储预先计算的值,特别是如果您不为过去的日期添加或修改行(或者可以使用触发器使它们保持最新). /p> I am using MySQL server via Amazon could service, with default settings. The table involved Which takes almost 10 min to execute. I have an index on The indexes from this table are: If I remove It only takes 0.15 sec to return the count. The count of this specific query is 692792; The counts are similar for other I don't have an index over Any suggestion is appreciated! To count the number of rows with a specific date, MySQL has to locate that value in the index (which is pretty fast, after all that is what indexes are made for) and then read the subsequent entries of the index until it finds the next date. Depending on the datatype of To calculate the average for a column that is not included in the index, MySQL will, again, use the index to find all rows for that date (the same as before). But additionally, for every row it finds, it has to read the actual table data for that row, which means to use the primary key to locate the row, read some bytes, and repeat this 700k times. This "random access" is a lot slower than the sequential read in the first case. (This gets worse by the problem that "some bytes" is the A solution to this is to include all used columns in the index (a "covering index"), e.g. create an index on In the comments, you mentioned that you need to calculate the average over 24 columns. If you want to calculate the To avoid the MySQL-limit of 16 columns per index, you could split it into two indexes (and two queries). Create e.g. the indexes Make sure to document this well, as there is no obvious reason to write the query this way, but it might be worth it. If you only ever average over a single column, you could add 24 seperate indexes (on Since As you mentioned, another option would be to build a summary table where you store precalculated values, especially if you do not add or modify rows for past dates (or can keep them up-to-date with a trigger). 这篇关于MySQL服务器上非常简单的AVG()聚合查询要花费很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!date, 01
上创建索引.然后,MySQL不需要访问表本身,并且可以像第一种方法一样通过仅读取索引来继续进行操作.索引的大小将增加一点,因此MySQL将需要读取更多的MB"(并执行avg
-运算),但仍然需要几秒钟的时间.avg
,则需要在所有列上都包含覆盖索引,例如date, 01, 02, ..., 24
防止表访问.请注意,包含所有列的索引所需的存储空间与表本身一样多(创建此类索引将花费很长时间),因此,是否值得这些资源取决于该查询的重要性. date, 01, .., 12
和date, 13, .., 24
,然后使用select * from (select `date`, avg(`01`), ..., avg(`12`)
from mytable where `date` = ...) as part1
cross join (select avg(`13`), ..., avg(`24`)
from mytable where `date` = ...) as part2;
date, 01
,date, 02
,...上),尽管它们总共将需要更多的空间,但可能会略有增加速度更快(因为它们分别更小).但是缓冲池可能仍然倾向于全索引,具体取决于使用模式和内存配置等因素,因此您可能必须对其进行测试.date
是主键的一部分,因此您也可以考虑将主键更改为date, esi
.如果通过主键找到日期,则不需要其他步骤来访问表数据(因为您已经访问了表),因此其行为类似于覆盖索引.但这是对表的重大更改,可能会影响所有其他查询(例如使用esi
来定位行),因此必须谨慎考虑.mytable
is of InnoDB
type and has about 1 billion rows.
The query is:select count(*), avg(`01`) from mytable where `date` = "2017-11-01";
date
. The EXPLAIN
of this query is:+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| 1 | SIMPLE | mytable | ref | date | date | 3 | const | 1411576 | NULL |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable | 0 | PRIMARY | 1 | ESI | A | 60398679 | NULL | NULL | | BTREE | | |
| mytable | 0 | PRIMARY | 2 | date | A | 1026777555 | NULL | NULL | | BTREE | | |
| mytable | 1 | lse_cd | 1 | lse_cd | A | 1919210 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | zone | 1 | zone | A | 732366 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | date | 1 | date | A | 85564796 | NULL | NULL | | BTREE | | |
| mytable | 1 | ESI_index | 1 | ESI | A | 6937686 | NULL | NULL | | BTREE | | |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
AVG()
:select count(*) from mytable where `date` = "2017-11-01";
date
s.01
. Is it an issue? Why AVG()
takes so long to compute? There must be something I didn't do properly. esi
, this will sum up to reading some MB of data to count your 700k rows. Reading some MB does not take much time (and that data might even already be cached in the buffer pool, depending on how often you use the index).innodb_page_size
(16KB by default), so you may have to read up to 700k * 16KB = 11GB, compared to "some MB" for count(*)
; and depending on your memory configuration, some of this data might not be cached and has to be read from disk.) date, 01
. Then MySQL does not need to access the table itself, and can proceed, similar to the first method, by just reading the index. The size of the index will increase a bit, so MySQL will need to read "some more MB" (and perform the avg
-operation), but it should still be a matter of seconds.avg
for several columns at the same time, you would need a covering index on all of them, e.g. date, 01, 02, ..., 24
to prevent table access. Be aware that an index that contains all columns requires as much storage space as the table itself (and it will take a long time to create such an index), so it might depend on how important this query is if it is worth those resources. date, 01, .., 12
and date, 13, .., 24
, then useselect * from (select `date`, avg(`01`), ..., avg(`12`)
from mytable where `date` = ...) as part1
cross join (select avg(`13`), ..., avg(`24`)
from mytable where `date` = ...) as part2;
date, 01
, date, 02
, ...), although in total, they will require even more space, but might be a little bit faster (as they are smaller individually). But the buffer pool might still favour the full index, depending on factors like usage patterns and memory configuration, so you may have to test it.date
is part of your primary key, you could also consider changing the primary key to date, esi
. If you find the dates by the primary key, you would not need an additional step to access the table data (as you already access the table), so the behaviour would be similar to the covering index. But this is a significant change to your table and can affect all other queries (that e.g. use esi
to locate rows), so it has to be considered carefully.