如何加速“选择计数(*)”与“分组”和“where”。 [英] How to speed up "select count(*)" with "group by" and "where"?

查看:143
本文介绍了如何加速“选择计数(*)”与“分组”和“where”。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何加快选择计数(*)分组

太慢,使用频率很高。

我使用 select count(*) code>表中有超过3,000,000行。

How to speed up select count(*) with group by?
It's too slow and is used very frequently.
I have a big trouble using select count(*) and group by with a table having more than 3,000,000 rows.

select object_title,count(*) as hot_num   
from  relations 
where relation_title='XXXX'   
group by object_title  

relation_title object_title 是varchar。
其中relation_title ='XXXX'返回超过1,000,000行,导致 object_title 上的索引无法正常工作。

relation_title, object_title is varchar. where relation_title='XXXX', which returns more than 1,000,000 rows, lead to the indexes on object_title could not work well.

推荐答案

这里有几个我会尝试,按照难度递增的顺序:

Here are several things I'd try, in order of increasing difficulty:

(更简单) - 确保您有正确的覆盖索引

CREATE INDEX ix_temp ON relations (relation_title, object_title);

这应该最大限度地提高你现有的模式,因为(除非你的版本的mySQL的优化器真的很蠢! ),它将最小化满足查询所需的I / O数量(不像如果索引以与整个索引必须扫描相反的顺序),它将覆盖查询,因此您不必触及聚集索引

This should maximize perf given your existing schema, since (unless your version of mySQL's optimizer is really dumb!) it will minimize the amount of I/Os needed to satisfy your query (unlike if the index is in the reverse order where the whole index must be scanned) and it will cover the query so you won't have to touch the clustered index.

(稍微难一点) - 确保您的varchar字段尽可能小

(a little harder) - make sure your varchar fields are as small as possible

MySQL上的varchar索引的一个挑战是,在处理查询时,字段的完整声明大小将被拉到RAM中。所以如果你有一个varchar(256),但只使用4个字符,你仍然支付256字节内存使用率,而查询正在处理。哎哟!因此,如果你可以轻松地缩小varchar限制,这应该加快你的查询。

One of the perf challenges with varchar indexes on MySQL is that, when processing a query, the full declared size of the field will be pulled into RAM. So if you have a varchar(256) but are only using 4 chars, you're still paying the 256-byte RAM usage while the query is being processed. Ouch! So if you can shrink your varchar limits easily, this should speed up your queries.

(更难) - 规范化

30%字符串值是一个明确的哭泣,规范化到另一个表,所以你不会重复的字符串数百万次。考虑规范化为三个表格并使用整数ID来连接它们。

30% of your rows having a single string value is a clear cry for normalizing into another table so you're not duplicating strings millions of times. Consider normalizing into three tables and using integer IDs to join them.

在某些情况下,您可以在封面下规范化并隐藏使用与当前表...然后你只需要让你的INSERT / UPDATE / DELETE查询知道标准化,但可以离开你的SELECT单独。

In some cases, you can normalize under the covers and hide the normalization with views which match the name of the current table... then you only need to make your INSERT/UPDATE/DELETE queries aware of the normalization but can leave your SELECTs alone.

(最硬) - 对您的字符串列进行哈希处理,并对哈希进行索引

正常化意味着改变太多的代码,但你可以改变你的架构一点,你可能想考虑为你的字符串列创建128位哈希(使用 MD5 function )。在这种情况下(与规范化不同),您不必更改所有查询,只需要更改INSERT和某些SELECT。无论如何,你需要哈希你的字符串字段,然后在哈希上创建一个索引,例如。

If normalizing means changing too much code, but you can change your schema a little bit, you may want to consider creating 128-bit hashes for your string columns (using the MD5 function). In this case (unlike normalization) you don't have to change all your queries, only the INSERTs and some of the SELECTs. Anyway, you'll want to hash your string fields, and then create an index on the hashes, e.g.

CREATE INDEX ix_temp ON relations (relation_title_hash, object_title_hash);

请注意,您需要使用SELECT来确保您正在进行计算哈希索引,而不是拉入聚集索引(需要解析object_title的实际文本值以满足查询)。

Note that you'll need to play around with the SELECT to make sure you are doing the computation via the hash index and not pulling in the clustered index (required to resolve the actual text value of object_title in order to satisfy the query).

此外,如果relation_title有一个小varchar size,但是对象标题的长度很大,那么你可以只对object_title进行hash,并在(relation_title,object_title_hash)上创建索引。

Also, if relation_title has a small varchar size but object title has a long size, then you can potentially hash only object_title and create the index on (relation_title, object_title_hash).

请注意,这个解决方案只有在这些字段中的一个或两个相对于哈希大小非常长时才有用。

Note that this solution only helps if one or both of these fields is very long relative to the size of the hashes.

另请注意,哈希存在一些令人感兴趣的区分大小写/排序规则的影响,因为小写字符串的哈希值与大写字符串的哈希值不同。因此,您需要确保在对字符串进行哈希处理之前对字符串应用规范化,换句话说,如果您处于不区分大小写的数据库中,则只有哈希小写。您还可能要从开始或结束修剪空间,具体取决于数据库处理前导/尾随空格的方式。

Also note that there are interesting case-sensitivity/collation impacts from hashing, since the hash of a lowercase string is not the same as a hash of an uppercase one. So you'll need to make sure you apply canonicalization to the strings before hashing them-- in otherwords, only hash lowercase if you're in a case-insensitive DB. You also may want to trim spaces from the beginning or end, depending on how your DB handles leading/trailing spaces.

这篇关于如何加速“选择计数(*)”与“分组”和“where”。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆