如何提高此数据分析的速度? [英] How do I increase the speed of this data analysis?

查看:45
本文介绍了如何提高此数据分析的速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要优化如何分析相当大的数据集,但是我不确定下一步是什么.我已经做了相当多的MySQL配置调整.

I need to optimize how I analyze a rather large data set, and I'm unsure of what the next steps are. I have done a fair bit of MySQL configuration tuning.

我有这个InnoDB表:

I have this InnoDB table:

+----------------+--------------+------+-----+---------+----------------+
| Field          | Type         | Null | Key | Default | Extra          |
+----------------+--------------+------+-----+---------+----------------+
| id             | int(250)     | NO   | PRI | NULL    | auto_increment |
| memory         | int(15)      | YES  | MUL | NULL    |                |
| q              | varchar(250) | YES  | MUL | NULL    |                |
| created        | datetime     | YES  |     | NULL    |                |
| modified       | datetime     | YES  |     | NULL    |                |
| dt             | datetime     | YES  | MUL | NULL    |                |
| site_id        | int(250)     | NO   | MUL | NULL    |                |
| execution_time | int(11)      | YES  | MUL | NULL    |                |
+----------------+--------------+------+-----+---------+----------------+

这里是10行的示例:

+-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+
| id        | memory   | q               | created             | modified            | dt                  | site_id | execution_time |
+-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+
| 266864867 | 38011080 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:04:44 |     890 |           1534 |
| 266864868 | 46090184 | node/16432      | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:04:46 |     890 |            840 |
| 266864869 | 50329248 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:05:16 |     890 |           2500 |
| 266864870 | 38011272 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:01 |     890 |           1494 |
| 266864871 | 46087732 | node/16432      | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:03 |     890 |            850 |
| 266864872 | 30304428 | node/303        | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:12 |     890 |            113 |
| 266864873 | 50329412 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:25 |     890 |           2465 |
| 266864874 | 28253112 | front_page      | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:25 |     890 |             86 |
| 266864875 | 28256044 | front_page      | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:08:32 |     890 |             81 |
| 266864876 | 38021072 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:08:55 |     890 |           1458 |
+-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+

这是表索引:

+----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
| Table    | Non_unique | Key_name             | Seq_in_index | Column_name    | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
| memories |          0 | PRIMARY              |            1 | id             | A         |     8473766 |     NULL | NULL   |      | BTREE      |         |
| memories |          1 | index_dt             |            1 | dt             | A         |     1210538 |     NULL | NULL   | YES  | BTREE      |         |
| memories |          1 | index_execution_time |            1 | execution_time | A         |        2344 |     NULL | NULL   | YES  | BTREE      |         |
| memories |          1 | index_memory         |            1 | memory         | A         |     8473766 |     NULL | NULL   | YES  | BTREE      |         |
| memories |          1 | index_site_id        |            1 | site_id        | A         |          16 |     NULL | NULL   |      | BTREE      |         |
| memories |          1 | index_q              |            1 | q              | A         |      338950 |     NULL | NULL   | YES  | BTREE      |         |
+----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+

它为许多不同的站点(site_id)存储了超过一百万条记录.对于给定的站点,可能有20,000行.存储的信息是各个页面请求的性能指标.如果重要的话,请使用非显而易见的字段:内存"字段是脚本使用的内存量,"q"是路径,"site_id"是对表站点"的引用.

It stores over a million records for many different sites (site_id). For a given site, there might be 20,000 rows. The information stored is performance metrics for individual page requests. If it matters, non-obvious fields: The memory field is how much memory the script used, q is the path, site_id is a reference to a table Sites.

我对此数据运行了两个慢查询.第一个获得25个记忆最多的页面:

I have two slow queries that I run on this data. The first gets the 25 most most memory-hog pages:

Select 
  Memory.q, count(*) as count, 
  AVG(Memory.memory) as average_memory, 
  MAX(Memory.memory) as peak_memory,
  AVG(Memory.execution_time) as average_execution_time,
  MAX(Memory.execution_time) as peak_execution_time 
FROM Memory 
WHERE site_id = $some_site_id 
ORDER BY average_memory DESC 
GROUP BY Memory.q
LIMIT 25

第二个查询获取给定网站的平均最慢25页:

The second query gets the the slowest average 25 pages for a given site:

Select 
  Memory.q, count(*) as count, 
  AVG(Memory.memory) as average_memory, 
  MAX(Memory.memory) as peak_memory,
  AVG(Memory.execution_time) as average_execution_time,
  MAX(Memory.execution_time) as peak_execution_time 
FROM Memory 
WHERE site_id = $some_site_id 
ORDER BY average_execution_time DESC 
GROUP BY Memory.q
LIMIT 25

我最近将表从MyISAM转换为InnoDB,以便这些读取不会锁定表.这导致更新此表的操作排队和滞后.

I recently converted the table from MyISAM to InnoDB, so that these reads would not lock the table. This was causing operations that update this table to queue up and lag.

除了要解决这个问题(增加InnoDB缓存大小)外,我还想看看是否还有其他选择.我从来没有使用过NoSQL数据库,但是据我了解,由于我使用聚合函数和查询,因此对它们没有太大帮助.

Beyond throwing more ram at the problem (to increase the InnoDB cache size), I want to see if there are other options. I've never worked with a NoSQL database, but from what I understand they won't be of much help here because I use aggregate functions and queries.

如果需要的话,该应用程序是用PHP编写的.

The app is written in PHP, if it matters.

有什么更好的方法来存储和分析这些数据的想法吗?

Any ideas for a better way to approach the storage and analysis of this data?

更新:

对查询进行概要分析显示,慢度全部在复制到临时表中.我将研究如何使这一步骤更快.

Profiling the query shows the slowness is all in the copying to temp table. I will research how to make this step faster.

+--------------------------------+----------+
| Status                         | Duration |
+--------------------------------+----------+
| starting                       | 0.000030 |
| checking query cache for query | 0.000065 |
| Opening tables                 | 0.000013 |
| System lock                    | 0.000004 |
| Table lock                     | 0.000014 |
| init                           | 0.000032 |
| optimizing                     | 0.000010 |
| statistics                     | 0.008119 |
| preparing                      | 0.000042 |
| Creating tmp table             | 0.000317 |
| executing                      | 0.000005 |
| Copying to tmp table           | 5.349280 |
| Sorting result                 | 0.006511 |
| Sending data                   | 0.000092 |
| end                            | 0.000005 |
| removing tmp table             | 0.001510 |
| end                            | 0.000007 |
| query end                      | 0.000004 |
| freeing items                  | 0.001163 |
| logging slow query             | 0.000006 |
| cleaning up                    | 0.000006 |
+--------------------------------+----------+
21 rows in set (0.01 sec)

mysql> show profile cpu for query 4;
+--------------------------------+----------+----------+------------+
| Status                         | Duration | CPU_user | CPU_system |
+--------------------------------+----------+----------+------------+
| starting                       | 0.000030 | 0.000000 |   0.000000 |
| checking query cache for query | 0.000065 | 0.000000 |   0.000000 |
| Opening tables                 | 0.000013 | 0.000000 |   0.000000 |
| System lock                    | 0.000004 | 0.000000 |   0.000000 |
| Table lock                     | 0.000014 | 0.000000 |   0.000000 |
| init                           | 0.000032 | 0.000000 |   0.000000 |
| optimizing                     | 0.000010 | 0.000000 |   0.000000 |
| statistics                     | 0.008119 | 0.001000 |   0.000000 |
| preparing                      | 0.000042 | 0.000000 |   0.000000 |
| Creating tmp table             | 0.000317 | 0.000000 |   0.000000 |
| executing                      | 0.000005 | 0.000000 |   0.000000 |
| Copying to tmp table           | 5.349280 | 0.687896 |   0.412937 |
| Sorting result                 | 0.006511 | 0.004999 |   0.001999 |
| Sending data                   | 0.000092 | 0.000000 |   0.000000 |
| end                            | 0.000005 | 0.000000 |   0.000000 |
| removing tmp table             | 0.001510 | 0.000000 |   0.001000 |
| end                            | 0.000007 | 0.000000 |   0.000000 |
| query end                      | 0.000004 | 0.000000 |   0.000000 |
| freeing items                  | 0.001163 | 0.000000 |   0.001000 |
| logging slow query             | 0.000006 | 0.000000 |   0.000000 |
| cleaning up                    | 0.000006 | 0.000000 |   0.000000 |
+--------------------------------+----------+----------+------------+

推荐答案

尽管确实显示site_id是多部分键(MUL)的一部分,但是您没有显示键结构.请注意,如果它不是该多部分键中的FIRST字段,则该键不能用于where子句.例如,如果您有

You don't show your key structure, though it does show that site_id is part of a multi-part key (MUL). note that if it's not the FIRST field in that multi-part key, then the key cannot be used for that where clause. e.g if you have

KEY somekey (field1, site_id, field3, ...)

然后,您的where子句必须同时包含fieldsite_id,该键才能在查询中使用.您不必按键中列出的顺序使用字段(where site_id=.. and field1=...where field1=... and site_id=...相同),但是由于field1出现在键定义中的site_id之前,因此您必须将其用作整个密钥都可以使用.

then your where clause must include BOTH field and site_id for that key to be usable in the query. You don't have to use the fields in the same order they're listed in the key (where site_id=.. and field1=... will work the same as where field1=... and site_id=...), but since field1 appears before site_id in the key's definition, you must use it as well for the entire key to be usable.

q字段也是如此.在覆盖的键中也必须首先使用它,否则这些键将不可用.

The same holds for your q field. It must also be first in the keys being covered, or those keys are unusable.

这篇关于如何提高此数据分析的速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆