跨层次数据优化MySQL查询 [英] Optimising MySQL queries across hierarchical data
问题描述
我有一个相当稳定的有序图〜100k的顶点和大小〜1k的边。它是二维的,因为它的顶点可以通过一对整数(x,y)
(基数〜100 x〜1000)来识别,并且所有边都严格增加在 x
。
I have a fairly stable directed graph of order ~100k vertices and size ~1k edges. It is two-dimensional insofar as its vertices can be identified by a pair of integers (x, y)
(of cardinality ~100 x ~1000) and all edges are strictly increasing in x
.
此外还有一个〜1k (key,val)的字典,
与每个顶点相关联的对。
There is furthermore a dictionary of ~1k (key, val)
pairs associated with each vertex.
我目前正在三个(InnoDB)表中将数据库存储在MySQL数据库中:一个顶点表我不认为这与我的问题有关,所以我省略了在下面的摘录中包含它和引用它的外键限制);一个容纳字典的表;以及由Bill Karwin如此雄辩地描述的连接顶点的关闭表。
I am currently storing the graph in a MySQL database across three (InnoDB) tables: a table of vertices (which I don't think is relevant to my question, so I have omitted to include both it and the foreign key constraints that refer to it in my extracts below); a table which holds the dictionaries; and a 'closure table' of connected vertices as described so eloquently by Bill Karwin.
顶点字典表定义如下:
CREATE TABLE `VertexDictionary` (
`x` smallint(6) unsigned NOT NULL,
`y` smallint(6) unsigned NOT NULL,
`key` varchar(50) NOT NULL DEFAULT '',
`val` smallint(1) DEFAULT NULL,
PRIMARY KEY (`x`, `y` , `key`),
KEY `dict` (`x`, `key`, `val`)
);
和所连接顶点的封闭表为:
and the closure table of connected vertices as:
CREATE TABLE `ConnectedVertices` (
`tail_x` smallint(6) unsigned NOT NULL,
`tail_y` smallint(6) unsigned NOT NULL,
`head_x` smallint(6) unsigned NOT NULL,
`head_y` smallint(6) unsigned NOT NULL,
PRIMARY KEY (`tail_x`, `tail_y`, `head_x`),
KEY `reverse` (`head_x`, `head_y`, `tail_x`),
KEY `fx` (`tail_x`, `head_x`),
KEY `rx` (`head_x`, `tail_x`)
);
还有一个(x,key)
对,对于每个这样的对,用 x
标识的所有顶点在其字典内都有一个值 key
。这个字典存储在第四个表中:
There is also a dictionary of (x, key)
pairs such that for each such pair, all vertices identified with that x
have within their dictionaries a value for that key
. This dictionary is stored in a fourth table:
CREATE TABLE `SpecialKeys` (
`x` smallint(6) unsigned NOT NULL,
`key` varchar(50) NOT NULL DEFAULT '',
PRIMARY KEY (`x`),
KEY `xkey` (`x`, `key`)
);
我经常希望提取所有顶点的字典中使用的一组键,具有特定的 x = X
,以及连接到左侧的任何 SpecialKeys
的关联值:
I often wish to extract the set of keys used in the dictionaries of all vertices having a particular x=X
, together with the associated value of any SpecialKeys
connected to the left:
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`ConnectedVertices` AS `c`
JOIN `VertexDictionary` AS `u` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
JOIN `VertexDictionary` AS `v` ON (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
JOIN `SpecialKeys` AS `k` ON (`k`.`x`, `k`.`key`) = (`u`.`x`, `u`.`key`)
WHERE
`v`.`x` = X
;
其中 EXPLAIN
输出是: / p>
for which the EXPLAIN
output is:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE k index PRIMARY,xkey xkey 154 NULL 40 Using index; Using temporary
1 SIMPLE c ref PRIMARY,reverse,fx,rx PRIMARY 2 db.k.x 1 Using where
1 SIMPLE v ref PRIMARY,dict PRIMARY 4 const,db.c.head_y 136 Using index
1 SIMPLE u eq_ref PRIMARY,dict PRIMARY 156 db.c.tail_x,db.c.tail_y,db.k.key 1 Using where
但是这个查询需要10秒才能完成。把我的头撞在砖墙上,试图改善事情,但无济于事。
But this query takes ~10s to complete. Been banging my head against a brick wall trying to improve matters, but to no avail.
可以改进查询,还是应该考虑不同的数据结构?非常感谢您的想法!
Can the query be improved, or should I consider a different data structure? Extremely grateful for your thoughts!
更新
我仍然无处可寻,尽管我重建了表,发现 EXPLAIN
输出略有不同(如上图所示,数字从 v
中提取的行已从1增加到136!查询仍然需要10秒钟才能执行。
I'm still getting nowhere with this, although I did rebuild the tables and found the EXPLAIN
output to be slightly different (as now shown above, the number of rows fetched from v
had increased from 1 to 136!); the query is still taking ~10s to execute.
我真的不明白这里发生了什么。查询获取所有(x,y,SpecialValue)
和所有(x,y,key)
元组都非常快速(分别约30ms和〜150ms),但基本上加入的时间比组合时间长五十倍以上?如何提高执行加盟所需的时间?
I really don't understand what's going on here. Queries to obtain all (x, y, SpecialValue)
and all (x, y, key)
tuples are both very fast (~30ms and ~150ms respectively), yet essentially joining the two takes over fifty times longer than their combined time... how can I improve the time taken to perform that join?
的输出SHOW VARIABLES LIKE'%innodb%';
以下:
Variable_name Value
------------------------------------------------------------
have_innodb YES
ignore_builtin_innodb ON
innodb_adaptive_flushing ON
innodb_adaptive_hash_index ON
innodb_additional_mem_pool_size 2097152
innodb_autoextend_increment 8
innodb_autoinc_lock_mode 1
innodb_buffer_pool_size 1179648000
innodb_change_buffering inserts
innodb_checksums ON
innodb_commit_concurrency 0
innodb_concurrency_tickets 500
innodb_data_file_path ibdata1:10M:autoextend
innodb_data_home_dir /rdsdbdata/db/innodb
innodb_doublewrite ON
innodb_fast_shutdown 1
innodb_file_format Antelope
innodb_file_format_check Barracuda
innodb_file_per_table ON
innodb_flush_log_at_trx_commit 1
innodb_flush_method O_DIRECT
innodb_force_recovery 0
innodb_io_capacity 200
innodb_lock_wait_timeout 50
innodb_locks_unsafe_for_binlog OFF
innodb_log_buffer_size 8388608
innodb_log_file_size 134217728
innodb_log_files_in_group 2
innodb_log_group_home_dir /rdsdbdata/log/innodb
innodb_max_dirty_pages_pct 75
innodb_max_purge_lag 0
innodb_mirrored_log_groups 1
innodb_old_blocks_pct 37
innodb_old_blocks_time 0
innodb_open_files 300
innodb_read_ahead_threshold 56
innodb_read_io_threads 4
innodb_replication_delay 0
innodb_rollback_on_timeout OFF
innodb_spin_wait_delay 6
innodb_stats_method nulls_equal
innodb_stats_on_metadata ON
innodb_stats_sample_pages 8
innodb_strict_mode OFF
innodb_support_xa ON
innodb_sync_spin_loops 30
innodb_table_locks ON
innodb_thread_concurrency 0
innodb_thread_sleep_delay 10000
innodb_use_sys_malloc ON
innodb_version 1.0.16
innodb_write_io_threads 4
推荐答案
没有花时间测试,你提供了一个不完整的例子?
你一定要尝试连接表的重新排序。解释输出提供一些信息,让我们说,由key_len订购应该是启发式的最快。我认为,要过滤的第一张表应该列为最后一个,以防优化器无法弄清楚。
Without spending time testing it, you provided an incomplete example? you should definitely try reordering of joined tables. Explain output provides some info, let's say ordering by key_len should be heuristically fastest. First table to be filtered on should be listed as last in case the optimizer is not able to figure that out, I believe.
所以,让我们说'c,v,k,u'顺序是最好的。
So, let's say 'c, v, k, u' order is the best.
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`VertexDictionary` AS `u`
JOIN `SpecialKeys` AS `k` ON (`k`.`x`, `k`.`key`) = (`u`.`x`, `u`.`key`)
JOIN `VertexDictionary` AS `v`
JOIN `ConnectedVertices` AS `c` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
AND (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
WHERE
`v`.`x` = X
;
'rows'会建议'c / u,k,v'顺序,但这取决于数据:
'rows' would suggest 'c/u, k, v' order, but that depends on data:
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`VertexDictionary` AS `u`
JOIN `VertexDictionary` AS `v`
JOIN `SpecialKeys` AS `k` ON (`k`.`x`, `k`.`key`) = (`u`.`x`, `u`.`key`)
JOIN `ConnectedVertices` AS `c` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
AND (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
WHERE
`v`.`x` = X
;
希望这有帮助。
UPDATE (避免使用varchar join):
UPDATE (avoiding the varchar join):
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`ConnectedVertices` AS `c`
JOIN `VertexDictionary` AS `u` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
JOIN `VertexDictionary` AS `v` ON (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
WHERE
(`u`.`x`, `u`.`key`) IN (SELECT `k`.`x`, `k`.`key` FROM `SpecialKeys` AS `k`)
AND
`v`.`x` = X
;
这篇关于跨层次数据优化MySQL查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!