MySQL性能:单个表和分区上的多个表与索引 [英] MySQL performance: multiple tables vs. index on single table and partitions

查看:110
本文介绍了MySQL性能:单个表和分区上的多个表与索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道什么是更高效和更快的性能:

在一个大表或多个没有索引的小表上有索引?

由于这是一个非常抽象的问题,让我让它变得更实用:

我有一个表有关于用户的统计信息(20,000个用户和大约3000万行)。该表有大约10列,包括 user_id 操作时间戳等等


最常见的应用程序是:按 user_id 插入数据并按user_id检索数据( SELECT 语句永远不会包含多个 user_id的)。

Since this is a pretty abstract problem let me make it more practical:
I have one table with statistics about users (20,000 users and about 30 million rows overall). The table has about 10 columns including the user_id, actions, timestamps, etc.
Most common applications are: Inserting data by user_id and retrieving data by user_id (SELECT statements never include multiple user_id's).

现在到目前为止,我有一个 user_id 上的> INDEX ,查询看起来像这样

Now so far I have an INDEX on the user_id and the query looks something like this

SELECT * FROM statistics WHERE user_id = 1

现在,行数越来越多表越来越慢。 INSERT 语句变慢,因为 INDEX 变得越来越大; SELECT 语句变慢,好吧,因为有更多行要搜索。

Now, with more and more rows the table gets slower and slower. INSERT statements slow down because the INDEX gets bigger and bigger; SELECT statements slow down, well, because there are more rows to search through.

现在我想知道为什么不为每个用户提供一个统计表,并将查询语法更改为以下内容:

Now I was wondering why not have one statistics table for each user and change the query syntax to something like this instead:

SELECT * FROM statistics_1

其中 1 表示 user_id 显然。

这样,不需要 INDEX ,每个表中的数据都少得多,所以 INSERT SELECT 语句应该快得多。

where 1 represents the user_id obviously.
This way, no INDEX is needed and there is far less data in each table, so INSERT and SELECT statements should be much faster.

现在我的问题了:

处理这么多表(在我的情况下是20,000)而不是使用一个带有 INDEX ?

我的方法是否真的会加快速度,或者表格的查找最终会减慢速度而不是一切?

推荐答案

创建20,000桌是个坏主意。不久之后你需要40,000个表,然后更多。

Creating 20,000 tables is a bad idea. You'll need 40,000 tables before long, and then more.

我在书中称这种综合症 Metadata Tribbles SQL Antipatterns 。您每次计划创建每X表或每X列时都会发生这种情况。

I called this syndrome Metadata Tribbles in my book SQL Antipatterns. You see this happen every time you plan to create a "table per X" or a "column per X".

当您有数十个这样做时,这会导致真正的性能问题成千上万的桌子。每个表都需要MySQL来维护内部数据结构,文件描述符,数据字典等。

This does cause real performance problems when you have tens of thousands of tables. Each table requires MySQL to maintain internal data structures, file descriptors, a data dictionary, etc.

还有实际的操作后果。你真的想创建一个系统,要求你每次新用户注册时都要创建一个新表吗?

There are also practical operational consequences. Do you really want to create a system that requires you to create a new table every time a new user signs up?

相反,我建议你使用 MySQL分区

以下是对表进行分区的示例:

Here's an example of partitioning the table:

CREATE TABLE statistics (
  id INT AUTO_INCREMENT NOT NULL,
  user_id INT NOT NULL,
  PRIMARY KEY (id, user_id)
) PARTITION BY HASH(user_id) PARTITIONS 101;

这为您提供了定义一个逻辑表的好处,同时还将表划分为多个物理表查询分区键的特定值时访问速度更快。

This gives you the benefit of defining one logical table, while also dividing the table into many physical tables for faster access when you query for a specific value of the partition key.

例如,当您运行类似示例的查询时,MySQL只访问包含特定值的正确分区user_id:

For example, When you run a query like your example, MySQL accesses only the correct partition containing the specific user_id:

mysql> EXPLAIN PARTITIONS SELECT * FROM statistics WHERE user_id = 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: statistics
   partitions: p1    <--- this shows it touches only one partition 
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 2
        Extra: Using where; Using index

分区的HASH方法意味着行按模数放置在分区中整数分区键。这意味着许多user_id映射到同一个分区,但每个分区平均只有1 / N的行数(其中N是分区数)。并且您使用恒定数量的分区定义表,因此您不必在每次获得新用户时对其进行扩展。

The HASH method of partitioning means that the rows are placed in a partition by a modulus of the integer partition key. This does mean that many user_id's map to the same partition, but each partition would have only 1/Nth as many rows on average (where N is the number of partitions). And you define the table with a constant number of partitions, so you don't have to expand it every time you get a new user.

您可以选择任意数量的分区分区最多1024(或MySQL 5.6中的8192),但有些人报告说它们出现这么高的性能问题。

You can choose any number of partitions up to 1024 (or 8192 in MySQL 5.6), but some people have reported performance problems when they go that high.

建议使用素数分区。如果你的user_id值遵循一种模式(比如只使用偶数),使用素数分区有助于更均匀地分配数据。

It is recommended to use a prime number of partitions. In case your user_id values follow a pattern (like using only even numbers), using a prime number of partitions helps distribute the data more evenly.

在评论中回答你的问题:

Re your questions in comment:


我如何确定合理数量的分区?

How could I determine a resonable number of partitions?

对于HASH分区,如果您使用上面示例中显示的101个分区,那么任何给定的分区平均有大约1%的行。您说您的统计信息表有3000万行,因此如果您使用此分区,则每个分区只有300k行。 MySQL更容易阅读。你也可以(也应该)使用索引 - 每个分区都有自己的索引,它只有整个未分区表上的索引的1%。

For HASH partitioning, if you use 101 partitions like I show in the example above, then any given partition has about 1% of your rows on average. You said your statistics table has 30 million rows, so if you use this partitioning, you would have only 300k rows per partition. That is much easier for MySQL to read through. You can (and should) use indexes as well -- each partition will have its own index, and it will be only 1% as large as the index on the whole unpartitioned table would be.

因此,您如何确定合理数量的分区的答案是:您的整个表格有多大,以及您希望平均分区有多大?

So the answer to how can you determine a reasonable number of partitions is: how big is your whole table, and how big do you want the partitions to be on average?


分区数量不应该随着时间的推移而增长吗?如果是这样:我如何自动化?

Shouldn't the amount of partitions grow over time? If so: How can I automate that?

如果使用HASH分区,分区数量不一定需要增长。最终你可能总共有300亿行,但我发现当你的数据量增长了几个数量级时,无论如何都需要一个新的架构。如果您的数据增长很大,您可能需要分片多个服务器以及分区到多个表。

The number of partitions doesn't necessarily need to grow if you use HASH partitioning. Eventually you may have 30 billion rows total, but I have found that when your data volume grows by orders of magnitude, that demands a new architecture anyway. If your data grow that large, you probably need sharding over multiple servers as well as partitioning into multiple tables.

那说,你可以重新 - 使用ALTER TABLE分区表:

That said, you can re-partition a table with ALTER TABLE:

ALTER TABLE statistics PARTITION BY HASH(user_id) PARTITIONS 401;

这必须重组表(就像大多数ALTER TABLE更改一样),所以期待它需要一段时间。

This has to restructure the table (like most ALTER TABLE changes), so expect it to take a while.

您可能希望监控分区中数据和索引的大小:

You may want to monitor the size of data and indexes in partitions:

SELECT table_schema, table_name, table_rows, data_length, index_length
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE partition_method IS NOT NULL;

与任何表一样,您希望活动索引的总大小适合您的缓冲池,因为如果MySQL在SELECT查询期间必须将部分索引交换进出缓冲池,性能会受到影响。

Like with any table, you want the total size of active indexes to fit in your buffer pool, because if MySQL has to swap parts of indexes in and out of the buffer pool during SELECT queries, performance suffers.

如果你使用RANGE或LIST分区,那么添加,删除,合并和拆分分区更为常见。请参阅 http://dev.mysql.com/ doc / refman / 5.6 / en / partitioning-management-range-list.html

If you use RANGE or LIST partitioning, then adding, dropping, merging, and splitting partitions is much more common. See http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html

我鼓励你阅读关于分区的手册部分,还看看这个漂亮的演示文稿:使用MySQL 5.1分区提升性能

I encourage you to read the manual section on partitioning, and also check out this nice presentation: Boost Performance With MySQL 5.1 Partitions.

这篇关于MySQL性能:单个表和分区上的多个表与索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆