18M +行表的子查询和MySQL Cache [英] Subqueries and MySQL Cache for 18M+ row table

查看:79
本文介绍了18M +行表的子查询和MySQL Cache的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于这是我的第一篇文章,看来我只能发布1个链接,所以我在底部列出了我所引用的网站.简而言之,我的目标是使数据库更快地返回结果,我试图包括尽可能多的相关信息,以帮助在帖子底部提出问题.

As this is my first post it seems I can only post 1 link so I have listed the sites I'm referring to at the bottom. In a nutshell my goal is to make the database return the results faster, I have tried to include as much relevant information as I could think of to help frame the questions at the bottom of the post.

8 processors
model name      : Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz
cache size      : 6144 KB
cpu cores       : 4 

top - 17:11:48 up 35 days, 22:22, 10 users,  load average: 1.35, 4.89, 7.80
Tasks: 329 total,   1 running, 328 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 87.4%id, 12.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8173980k total,  5374348k used,  2799632k free,    30148k buffers
Swap: 16777208k total,  6385312k used, 10391896k free,  2615836k cached

但是,我们正在考虑将mysql安装移动到具有256 GB RAM的群集中的另一台计算机上

However we are looking at moving the mysql installation to a different machine in the cluster that has 256 GB of ram

我的MySQL表看起来像

My MySQL Table looks like

CREATE TABLE ClusterMatches 
(
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    cluster_index INT, 
    matches LONGTEXT,
    tfidf FLOAT,
    INDEX(cluster_index)   
);

它大约有1800万行,有1M个唯一的cluster_index和6K个唯一的匹配项.我在PHP中生成的sql查询看起来像.

It has approximately 18M rows, there are 1M unique cluster_index's and 6K unique matches. The sql query I am generating in PHP looks like.

$sql_query="SELECT `matches`,sum(`tfidf`) FROM 
(SELECT * FROM Test2_ClusterMatches WHERE `cluster_index` in (".$clusters.")) 
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) DESC LIMIT 0, 10;";

其中$ cluster包含大约3,000个逗号分隔的cluster_index的字符串.该查询使用大约50,000行,大约需要15s来运行,当再次运行同一查询时,大约需要1s来运行.

where $cluster contains a string of approximately 3,000 comma separated cluster_index's. This query makes use of approximately 50,000 rows and takes approximately 15s to run, when the same query is run again it takes approximately 1s to run.

  1. 可以将表的内容假定为静态.
  2. 并发用户数少
  3. 上面的查询是当前将在表上运行的唯一查询

子查询


基于本文[stackoverflow:在MySQL中缓存/重新使用子查询] [1]和查询时间的缩短,我相信我的子查询可以被索引.

Subquery


Based on this post [stackoverflow: Cache/Re-Use a Subquery in MySQL][1] and the improvement in query time I believe my subquery can be indexed.

mysql> EXPLAIN EXTENDED SELECT `matches`,sum(`tfidf`) FROM 
(SELECT * FROM ClusterMatches WHERE `cluster_index` in (1,2,...,3000) 
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) ASC LIMIT 0, 10;

+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| id | select_type | table                | type  | possible_keys | key           | key_len | ref  | rows  | Extra                           |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
|  1 | PRIMARY     |  derived2            | ALL   | NULL          | NULL          | NULL    | NULL | 48528 | Using temporary; Using filesort | 
|  2 | DERIVED     | ClusterMatches       | range | cluster_index | cluster_index | 5       | NULL | 53689 | Using where                     | 
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+

根据这篇较早的文章[Optimized MySQL:Queries and Indexes] [2]中的额外信息-不好的地方是使用临时"和使用文件排序"

According to this older article [Optimizing MySQL: Queries and Indexes][2] in Extra info - the bad ones to see here are "using temporary" and "using filesort"

查询缓存可用,但由于当前大小设置为零,因此实际上已关闭

Query cache is available, but effectively turned off as the size is currently set to zero


mysqladmin variables;
+---------------------------------+----------------------+
| Variable_name                   | Value                |
+---------------------------------+----------------------+
| bdb_cache_size                  | 8384512              | 
| binlog_cache_size               | 32768                | 
| expire_logs_days                | 0                    |
| have_query_cache                | YES                  | 
| flush                           | OFF                  |
| flush_time                      | 0                    |
| innodb_additional_mem_pool_size | 1048576              |
| innodb_autoextend_increment     | 8                    |
| innodb_buffer_pool_awe_mem_mb   | 0                    |
| innodb_buffer_pool_size         | 8388608              |
| join_buffer_size                | 131072               |
| key_buffer_size                 | 8384512              |
| key_cache_age_threshold         | 300                  |
| key_cache_block_size            | 1024                 |
| key_cache_division_limit        | 100                  |
| max_binlog_cache_size           | 18446744073709547520 | 
| sort_buffer_size                | 2097144              |
| table_cache                     | 64                   | 
| thread_cache_size               | 0                    | 
| query_cache_limit               | 1048576              |
| query_cache_min_res_unit        | 4096                 |
| query_cache_size                | 0                    |
| query_cache_type                | ON                   |
| query_cache_wlock_invalidate    | OFF                  |
| read_rnd_buffer_size            | 262144               |
+---------------------------------+----------------------+

基于这篇关于[Mysql Database Performance Turning] [3]的文章,我认为我需要调整的值是

Based on this article on [Mysql Database Performance turning][3] I believe that the values I need to tweak are

  1. table_cache
  2. key_buffer
  3. sort_buffer
  4. read_buffer_size
  5. record_rnd_buffer(用于GROUP BY和ORDER BY条款)

已确定需要改进的领域-MySQL查询调整


  1. 更改数据类型以匹配指向一个指向另一个表的int的索引[如果MySQL包含诸如TEXT或BLOB的可变长度字段,则MySQL的确会使用动态行格式,在这种情况下,这意味着需要进行排序在磁盘上完成.解决方案不是避免这些数据类型,而是将这些字段拆分为一个关联的表.] [4]
  2. 基于语句[您可能应该为要选择,分组,排序或加入的任何字段创建索引.]为新的match_index字段建立索引,以便GROUP BY matches更快地出现. ]
  1. Changing the datatype for matches to an index that is an int pointing to another table [MySQL will indeed use a dynamic row format if it contains variable length fields like TEXT or BLOB, which, in this case, means sorting needs to be done on disk. The solution is not to eschew these datatypes, but rather to split off such fields into an associated table.][4]
  2. Indexing the new match_index feild so that the GROUP BY matches occurs faster, based on the statement ["You should probably create indices for any field on which you are selecting, grouping, ordering, or joining."][5]

工具


要进行调整,我打算使用

Tools


To tweak perform I plan to use

  1. [解释] [6]引用了[输出格式] [7]
  2. [ab-Apache HTTP服务器基准测试工具] [8]
  3. [配置文件] [9]和[日志数据] [10]

未来数据库的大小


目标是构建一个系统,该系统可以具有1M个唯一的cluster_index值和1M个唯一的匹配值,大约3,000,000,000个表行,对查询的响应时间约为0.5s(我们可以根据需要添加更多ram,并在整个数据库中分布数据库集群)

Future Database Size


The goal is to build a system that can have 1M unique cluster_index values 1M unique match values, approx 3,000,000,000 table rows with a response time to the query of around 0.5s (we can add more ram as necessary and distribute the database across the cluster)

  1. 我认为我们希望将整个记录集保留在ram中,以便查询不会接触到磁盘,如果我们将整个数据库保留在MySQL缓存中,这是否消除了对memcachedb的需求?
  2. 是否试图将整个数据库保留在MySQL缓存中是一个错误的策略,因为它的目的不是持久性的?像memcachedb或redis这样的方法会更好吗?
  3. 查询完成后,查询创建的临时表结果"是否会自动销毁?
  4. 我们应该从Innodb切换到MyISAM吗(因为它可以读取大量数据,而InnoDB可以读取大量数据)[11]?
  5. 在我的[Query Cache Configuration] [12]中,我的缓存似乎没有显示为零,为什么第二次运行查询时查询当前发生得更快?
  6. 我可以重组查询以消除发生使用临时"和使用文件排序"的情况,我应该使用联接而不是子查询吗?
  7. 您如何查看MySQL [Data Cache] [13]的大小?
  8. 您建议以哪种大小来设置table_cache,key_buffer,sort_buffer,read_buffer_size,record_rnd_buffer值?

链接


  • 1:stackoverflow.com/questions/658937/cache-re-use-a-subquery-in-mysql
  • 2:databasejournal.com/features/mysql/article.php/10897_1382791_4/Optimizing-MySQL-Queries-and-Indexes.htm
  • 3:debianhelp.co.uk/mysqlperformance.htm
  • 4:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
  • 5:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
  • 6:dev.mysql.com/doc/refman/5.0/en/explain.html
  • 7:dev.mysql.com/doc/refman/5.0/en/explain-output.html
  • 8:httpd.apache.org/docs/2.2/programs/ab.html
  • 9:mtop.sourceforge.net/
  • 10:dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
  • 11:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
  • 12:dev.mysql.com/doc/refman/5.0/en/query-cache-configuration.html
  • 13:dev.mysql.com/tech-resources/articles/mysql-query-cache.html
  • Links


    • 1: stackoverflow.com/questions/658937/cache-re-use-a-subquery-in-mysql
    • 2: databasejournal.com/features/mysql/article.php/10897_1382791_4/Optimizing-MySQL-Queries-and-Indexes.htm
    • 3: debianhelp.co.uk/mysqlperformance.htm
    • 4: 20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
    • 5: 20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
    • 6: dev.mysql.com/doc/refman/5.0/en/explain.html
    • 7: dev.mysql.com/doc/refman/5.0/en/explain-output.html
    • 8: httpd.apache.org/docs/2.2/programs/ab.html
    • 9: mtop.sourceforge.net/
    • 10: dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
    • 11: 20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
    • 12: dev.mysql.com/doc/refman/5.0/en/query-cache-configuration.html
    • 13: dev.mysql.com/tech-resources/articles/mysql-query-cache.html
    • 推荐答案

      更改表格


      基于此帖子在如何为查询选择排序依据和分组索引,表格现在看起来像

      Changing the table


      Based on the advice in this post on How to pick indexes for order by and group by queries the table now looks like

      CREATE TABLE ClusterMatches 
      (
          cluster_index INT UNSIGNED, 
          match_index INT UNSIGNED,
          id INT NOT NULL AUTO_INCREMENT,
          tfidf FLOAT,
          PRIMARY KEY (match_index,cluster_index,id,tfidf)
      );
      CREATE TABLE MatchLookup 
      (
          match_index INT UNSIGNED NOT NULL PRIMARY KEY,
          image_match TINYTEXT
      );
      

      不按SUM(tfidf)对结果进行排序的查询看起来像

      The query without sorting the results by the SUM(tfidf) looks like

      SELECT match_index, SUM(tfidf) FROM ClusterMatches 
      WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;

      消除了使用临时文件和文件排序

      Which eliminates using temporary and using filesort

      explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches 
      WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
      +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
      | id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                    |
      +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
      |  1 | SIMPLE      | ClusterMatches       | range | PRIMARY       | PRIMARY | 4       | NULL | 14938 | Using where; Using index | 
      +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+

      但是,如果我在ORDER BY SUM(tfdif)中添加

      However if i add the ORDER BY SUM(tfdif) in

      SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
      WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index 
      ORDER BY total DESC LIMIT 0,10;
      +-------------+--------------------+
      | match_index | total              |
      +-------------+--------------------+
      |         868 |   0.11126546561718 | 
      |        4182 | 0.0238558370620012 | 
      |        2162 | 0.0216601379215717 | 
      |        1406 | 0.0191618576645851 | 
      |        4239 | 0.0168981291353703 | 
      |        1437 | 0.0160425212234259 | 
      |        2599 | 0.0156466849148273 | 
      |         394 | 0.0155945559963584 | 
      |        3116 | 0.0151005545631051 | 
      |        4028 | 0.0149106932803988 | 
      +-------------+--------------------+
      10 rows in set (0.03 sec)

      在这种规模的情况下,结果速度很快,但是 ORDER BY SUM(tfidf)表示它使用临时和文件排序

      The result is suitably fast at this scale BUT having the ORDER BY SUM(tfidf) means it uses temporary and filesort

      explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches 
      WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index 
      ORDER BY total DESC LIMIT 0,10;
      +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
      | id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                                                     |
      +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
      |  1 | SIMPLE      | ClusterMatches       | range | PRIMARY       | PRIMARY | 4       | NULL | 65369 | Using where; Using index; Using temporary; Using filesort | 
      +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+

      我正在寻找一种不使用临时或文件排序的解决方案,

      Im looking for a solution that doesn't use temporary or filesort, along the lines of

      SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches 
      WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index 
      HAVING total>0.01 ORDER BY cluster_index;

      在这里我不需要为总阈值硬编码任何想法吗?

      where I dont need to hardcode a threshold for total, any ideas?

      这篇关于18M +行表的子查询和MySQL Cache的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆