MySQL:将大表拆分为分区或单独的表? [英] MySQL: Splitting a large table into partitions or separate tables?

查看:97
本文介绍了MySQL:将大表拆分为分区或单独的表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含20多个表的MySQL数据库,但是其中一个表非常大,因为它从不同的传感器收集测量数据.它在磁盘上的大小约为145 GB,包含超过10亿条记录.所有这些数据也将被复制到另一个MySQL服务器.

I have a MySQL database with over 20 tables, but one of them is significantly large because it collects measurement data from different sensors. It's size is around 145 GB on disk and it contains over 1 billion records. All this data is also being replicated to another MySQL server.

我想将数据分成较小的碎片",所以我的问题是以下哪种解决方案会更好.我将使用记录的时间戳"将数据除以年份.几乎所有在此表上执行的SELECT查询都在查询的"where"部分中包含"timestamp"字段.

I'd like to separate the data to smaller "shards", so my question is which of the below solutions would be better. I'd use the record's "timestamp" for dividing the data by years. Almost all SELECT queries that are executed on this table contain the "timestamp" field in the "where" part of the query.

以下是我无法决定的解决方案:

So below are the solutions that I cannot decide on:

  1. 使用MySQL分区并将数据按年份划分(例如,partition1-2010,partition2-2011等)
  2. 创建单独的表格并将数据除以年份(例如,measurement_2010,measurement_2011等表格)

还有我不知道的其他(较新的)可能选择吗?

Are there any other (newer) possible options that I'm not aware of?

我知道,在第一种情况下,MySQL本身将从碎片"中获取数据,在第二种情况下,我必须为其编写一种包装,然后由我自己完成.对于第二种情况,还有其他方法可以使所有单独的表被视为从中获取数据的一个大表"吗?

I know that in the first case MySQL itself would get the data from the 'shards' and in the second case I'd have to write a kind of wrapper for it and do it by myself. Is there any other way for the second case that would make all separate tables to be seen as 'one big table' to fetch data from?

我知道这个问题过去已经提出过,但是也许有人想出了一些新的解决方案(我不知道),或者最佳实践解决方案现在已经改变了.:)

I know this question has already been asked in the past, but maybe somebody came up with some new solution (that I'm not aware of) or that the best practice solution changed by now. :)

非常感谢您的帮助.

该架构与此类似:

device_id (INT)
timestamp (DATETIME)
sensor_1_temp (FLOAT)
sensor_2_temp (FLOAT)
etc. (30 more for instance)

所有传感器温度每分钟一次写入同一时间.请注意,大约有30个不同的传感器测量值连续记录.这些数据主要用于显示图形和其他一些统计目的.

All sensor temperatures are written at the same moment once a minute. Note that there around 30 different sensors measurements written in a row. This data is mostly used for displaying graphs and some other statistic purposes.

推荐答案

好吧,如果您希望得到一个新的答案,那意味着您可能已经阅读了我的答案,而且听起来像是破纪录了.参见 分区博客 ,了解可以进行分区的几种用例帮助表现.您的声音听起来不像这四种情况中的任何一种.

Well, if you are hoping for a new answer, that means you have probably read my answers, and I sound like a broken record. See Partitioning blog for the few use cases where partitioning can help performance. Yours does not sound like any of the 4 cases.

缩小 device_id . INT 是4个字节;您真的有数百万台设备吗? TINYINT UNSIGNED 是1个字节,范围为0..255. SMALLINT UNSIGNED 是2个字节,范围为0..64K.那会缩小桌子一点.

Shrink device_id. INT is 4 bytes; do you really have millions of devices? TINYINT UNSIGNED is 1 byte and a range of 0..255. SMALLINT UNSIGNED is 2 bytes and a range of 0..64K. That will shrink the table a little.

如果您的 real 问题是关于如何管理这么多数据的,那么让我们跳出框框思考".继续阅读.

If your real question is about how to manage so much data, then let's "think outside the box". Read on.

绘图...您绘图的日期范围是什么?

Graphing... What date ranges are you graphing?

  • 最后"小时/天/周/月/年?
  • 每小时/每天/每周/每月/每年的任意时间?
  • 任意范围,不受日/周/月/年边界的限制?

您在画些什么?

  • 一天的平均价值?
  • 一天中的最大/分钟?
  • 一天或一周之内的烛台(等等)

无论哪种情况,都应使用数据构建(并增量维护)汇总表.一行将包含一个小时的摘要信息.我建议

Regardless of the case, you should build (and incrementally maintain) a Summary Table with data. A row would contain summary info for one hour. I would suggest

CREATE TABLE Summary (
    device_id SMALLINT UNSIGNED NOT NULL,
    sensor_id TINYINT UNSIGNED NOT NULL,
    hr TIMESTAMP NOT NULL,
    avg_val FLOAT NOT NULL,
    min_val FLOAT NOT NULL,
    max_val FLOAT NOT NULL
    PRIMARY KEY (device_id, sensor_id, hr)
) ENGINE=InnoDB;

一个摘要表可能是9GB(用于当前数据量).

The one Summary table might be 9GB (for current amount of data).

SELECT hr,
       avg_val,
       min_val,
       max_val
    FROM Summary
    WHERE device_id = ?
      AND sensor_id = ?
      AND hr >= ?
      AND hr  < ? + INTERVAL 20 DAY;

将为您提供480小时的高/低/平均值;足以绘制图形?从摘要表中获取480行比从原始数据表中获取60 * 480行要快得多.

Would give you the hi/lo/avg values for 480 hours; enough to graph? Grabbing 480 rows from the summary table is a lot faster than grabbing 60*480 rows from the raw data table.

获得一年的相似数据可能会扼杀一个图形处理程序包,因此 可能值得构建该摘要的摘要-具有一天的分辨率.大约是0.4GB.

Getting similar data for a year would probably choke a graphing package, so it may be worth building a summary of the summary -- with resolution of a day. It would be about 0.4GB.

有几种不同的方式来构建摘要表;我们可以在您思考它的美丽并阅读 摘要表博客之后进行讨论.最好的方法是收集一小时的数据,然后扩充摘要"表.这有点像讨论的触发器 我的登台表博客 .

There are a few different ways to build the Summary table(s); we can discuss that after you have pondered its beauty and read Summary tables blog. It may be that gathering one hour's worth of data, then augmenting the Summary table, is the best way. That would be somewhat like the flip-flop discussed my Staging table blog.

并且,如果您有每小时的摘要,您是否真的需要每分钟的数据?考虑扔掉它.或者,也许是一个月后的数据.这导致使用分区,但是仅出于删除旧数据的好处,如

And, if you had the hourly summaries, do you really need the minute-by-minute data? Consider throwing it away. Or, maybe data after, say, one month. That leads to using partitioning, but only for its benefit in deleting old data as discussed in "Case 1" of Partitioning blog. That is, you would have daily partitions, using DROP and REORGANIZE every night to shift the time of the "Fact" table. This would lead to decreasing your 145GB footprint, but without losing much data. New footprint: About 12GB (Hourly summary + last 30 days' minute-by-minute details)

PS:汇总表博客显示了如何获取标准偏差.

PS: The Summary Table blog shows how to get standard deviation.

这篇关于MySQL:将大表拆分为分区或单独的表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆