使用CHAR / VARCHAR索引时,为什么MySQL查询的性能如此糟糕? [英] Why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?

查看:134
本文介绍了使用CHAR / VARCHAR索引时,为什么MySQL查询的性能如此糟糕?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我将描述问题域的简化版本。

First, I will describe a simplified version of the problem domain.

有表字符串

CREATE TABLE strings (
  value CHAR(3) COLLATE utf8_unicode_ci NOT NULL,
  INDEX(value)
) ENGINE=InnoDB;

如您所见,它有一个非唯一索引 CHAR( 3)列。

As you can see, it have a non-unique index of CHAR(3) column.

使用以下脚本填充表格:

The table is populated using the following script:

CREATE TABLE a_variants (
  letter CHAR(1) COLLATE utf8_unicode_ci  NOT NULL
) ENGINE=MEMORY;

INSERT INTO a_variants VALUES -- 60 variants of letter 'A'
  ('A'),('a'),('À'),('Á'),('Â'),('Ã'),('Ä'),('Å'),('à'),('á'),('â'),('ã'),
  ('ä'),('å'),('Ā'),('ā'),('Ă'),('ă'),('Ą'),('ą'),('Ǎ'),('ǎ'),('Ǟ'),('ǟ'),
  ('Ǡ'),('ǡ'),('Ǻ'),('ǻ'),('Ȁ'),('ȁ'),('Ȃ'),('ȃ'),('Ȧ'),('ȧ'),('Ḁ'),('ḁ'),
  ('Ạ'),('ạ'),('Ả'),('ả'),('Ấ'),('ấ'),('Ầ'),('ầ'),('Ẩ'),('ẩ'),('Ẫ'),('ẫ'),
  ('Ậ'),('ậ'),('Ắ'),('ắ'),('Ằ'),('ằ'),('Ẳ'),('ẳ'),('Ẵ'),('ẵ'),('Ặ'),('ặ');

INSERT INTO strings
  SELECT CONCAT(a.letter, b.letter, c.letter) -- 60^3 variants of string 'AAA'
    FROM a_variants a, a_variants b, a_variants c
  UNION ALL SELECT 'BBB'; -- one variant of string 'BBB'

因此,它包含216000无法区分(就 utf8_unicode_ci collat​​ion)字符串AAA的变体和字符串BBB的一个变体:

So, it contains 216000 indistinguishable (in terms of the utf8_unicode_ci collation) variants of string "AAA" and one variant of string "BBB":

SELECT value, COUNT(*) FROM strings GROUP BY value;



+-------+----------+
| value | COUNT(*) |
+-------+----------+
| AAA   |   216000 |
| BBB   |        1 |
+-------+----------+

被编入索引时,我希望以下两个查询具有相似的性能:

As value is indexed, I expect the following two queries to have similar performance:

SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA';
SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB';

但在实践中,第一个比 <300倍慢比第二!请参阅:

But in practice the first one is more than 300x times slower than the second! See:

+----------+------------+---------------------------------------------------------------+
| Query_ID | Duration   | Query                                                         |
+----------+------------+---------------------------------------------------------------+
|        1 | 0.11749275 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
|        2 | 0.00033325 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB' |
|        3 | 0.11718050 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
+----------+------------+---------------------------------------------------------------+

- 为了确定,我在这里运行了两次AAA查询。

-- I ran the 'AAA' query twice here just to be sure.

如果我更改索引列的大小或将其类型更改为 VARCHAR ,性能问题仍然会出现。同时,在类似的情况下,但当非唯一索引不是 CHAR / VARCHAR 时(例如 INT ),查询速度与预期一样快。

If I change size of the indexed column or change its type to VARCHAR, the problem with performance still manifests itself. Meanwhile, in analogous situations, but when the non-unique index is not CHAR/VARCHAR (e.g. INT), queries are as fast as expected.

因此,问题是为什么MySQL查询的性能在使用时会如此糟糕 CHAR / VARCHAR index?

So, the question is why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?

我有强烈的感觉MySQL对索引键匹配的所有值执行全线性扫描。但是为什么它只能返回匹配行的计数呢?我错过了什么,这真的需要吗?或者这是MySQL优化器的一个可悲缺点?

I have strong feeling that MySQL perform full linear scan of all the values matched by the index key. But why it do so when it can just return the count of the matched rows? Am I missing something and that is really needed? Or is that a sad shortcoming of MySQL optimizer?

推荐答案

显然,问题是查询正在进行索引扫描。另一种方法是对第一个和最后一个相同的值进行两次索引查找,然后在索引中使用元信息进行计算。根据你的观察,MySQL做了两个。

Clearly, the issue is that the query is doing an index scan. The alternative approach would be to do two index lookups, for the first and last values that are the same, and then use meta information in the index for the calculation. Based on your observations, MySQL does both.

这个答案的其余部分是推测。

The rest of this answer is speculation.

原因是性能仅慢300倍,而不是慢200,000倍,是因为读取索引的开销。实际上,与其他所需的操作相比,扫描条目的速度非常快。

The reason the performance is "only" 300 times slower, rather than 200,000 times slower, is because of overhead in reading the index. Actually scanning the entries is quite fast compared to other operations that are needed.

在进行比较时,数字和字符串之间存在根本区别。引擎只需查看两个数字的位表示,并识别它们是相同还是不同。不幸的是,对于字符串,您需要考虑编码/整理。我认为这就是它需要查看这些值的原因。

There is a fundamental difference between numbers and strings when it comes to comparisons. The engine can just look at the bit representations of two numbers and recognize whether they are the same or different. Unfortunately, for strings, you need to take encoding/collation into account. I think that is why it needs to look at the values.

如果您有<216>份完全相同的字符串,那么MySQL将能够使用索引中的元数据进行计数。换句话说,索引器足够智能,可以使用元数据进行精确的相等比较。但是,将编码考虑在内并不够聪明。

It is possible that if you had 216,000 copies of exactly the same string, then MySQL would be able to do the count using metadata in the index. In other words, the indexer is smart enough to use metadata for exact equality comparisons. But, it is not smart enough to take encoding into account.

这篇关于使用CHAR / VARCHAR索引时,为什么MySQL查询的性能如此糟糕?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆