如何优化缓慢的MySQL查询以找到相关性 [英] How to optimize painfully slow MySQL query that finds correlations

查看:85
本文介绍了如何优化缓慢的MySQL查询以找到相关性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常慢的MySQL查询(通常接近60秒),该查询试图查找用户对一项民意测验的投票方式与他们对先前所有民意测验的投票方式之间的相关性.

I have a very slow (usually close to 60 seconds) MySQL query that tries to find correlations between how users voted on one poll and how they voted on all previous polls.

基本上,我们会收集给定民意调查中投票支持某个特定选项的所有人的用户ID.

Basically, we gather the user IDs of everyone who voted for one particular option in a given poll.

然后,我们看到该子组如何在每个先前的民意调查中投票,并将这些结果与每个人(不仅是子组)对该民意调查的投票方式进行比较.子组结果与总结果之间的差异是偏差,该查询按偏差排序以确定最强的相关性.

Then we see how that subgroup voted on each previous poll, and compare those results to how EVERYONE (not just the subgroup) voted on that poll. The difference between the subgroup results and the total results is the deviation, and this query sorts by deviation to determine the strongest correlation.

查询有点混乱:

(SELECT p_id as poll_id, o_id AS option_id, description, optCount AS option_count, subgroup_percent, total_percent, ABS(total_percent - subgroup_percent) AS deviation
FROM(
   SELECT poll_id AS p_id, 
       option_id AS o_id, 
       (SELECT description FROM `option` WHERE id = o_id) AS description,
       COUNT(*) AS optCount, 
       (SELECT COUNT(*) FROM response INNER JOIN user_ids_122 ON response.user_id = user_ids_122.user_id WHERE option_id = o_id ) / 
       (SELECT COUNT(*) FROM response INNER JOIN user_ids_122 ON response.user_id = user_ids_122.user_id WHERE poll_id = p_id) AS subgroup_percent,
       (SELECT COUNT(*) FROM response WHERE option_id = o_id) / 
       (SELECT COUNT(*) FROM response WHERE poll_id = p_id) AS total_percent
   FROM response 
   INNER JOIN user_ids_122 ON response.user_id = user_ids_122.user_id 
   WHERE poll_id < '61'
   GROUP BY option_id DESC
   ) AS derived_table_122
)
ORDER BY deviation DESC, option_count DESC

请注意,user_ids_122是先前创建的临时表,其中包含对选项ID 122进行投票的所有用户的ID.

Note that user_ids_122 is a previously created temporary table containing the IDs of all users who voted for option ID 122.

响应"表大约有65,000行,用户"表大约有7,000行,选项"表大约有130行.

The "response" table has about 65,000 rows, the "user" table has about 7,000 rows, and the "option" table has about 130 rows.

更新:

这是EXPLAIN表...

Here's the EXPLAIN table ...

1   PRIMARY     <derived2>  ALL     NULL    NULL    NULL    NULL    121     Using filesort
2   DERIVED     user_ids_122    ALL     NULL    NULL    NULL    NULL    74  Using temporary; Using filesort
2   DERIVED     response    ref     poll_id,user_id     user_id     4   correlated.user_ids_122.user_id     780     Using where
7   DEPENDENT SUBQUERY  response    ref     poll_id     poll_id     4   func    7800    Using index
6   DEPENDENT SUBQUERY  response    ref     option_id   option_id   4   func    7800    Using index
5   DEPENDENT SUBQUERY  user_ids_122    ALL     NULL    NULL    NULL    NULL    74   
5   DEPENDENT SUBQUERY  response    ref     poll_id,user_id     poll_id     4   func    7800    Using where
4   DEPENDENT SUBQUERY  user_ids_122    ALL     NULL    NULL    NULL    NULL    74   
4   DEPENDENT SUBQUERY  response    ref     user_id,option_id   user_id     4   correlated.user_ids_122.user_id     780     Using where
3   DEPENDENT SUBQUERY  option  eq_ref  PRIMARY     PRIMARY     4   func    1 

更新2:

响应"表中的每一行都看起来像这样:

Every row in the "response" table looks like this:

id (INT)   poll_id (INT)   user_id (INT)   option_id (INT)   created (DATETIME)
7          7               1               14                2011-03-17 09:25:10

选项"表中的每一行都看起来像这样:

Every row in the "option" table looks like this:

id (INT)   poll_id (INT)   text (TEXT)     description (TEXT)
14         7               No              people who dislike country music 

用户"表中的每一行都看起来像这样:

Every row in the "user" table looks like this:

id (INT)   email (TEXT)         created (DATETIME)
1          user@example.com     2011-02-15 11:16:03

推荐答案

3件事:

  • 您正在重新计算大约半天的同一件事(实际上,所有内容仅取决于对许多行都相同的某些参数)
  • 在大块(JOIN)中聚合比在小块(子查询)中效率更高
  • MySQL带有子查询的速度非常慢.

因此,当您计算"option_id的投票数"(需要扫描大表)时,然后 您需要计算"poll_id的投票数",那么,不要再次启动大表,只需使用先前的结果即可!

So, when you compute "vote counts by option_id" (which needs scanning the big table), and then you need to compute "vote counts by poll_id", well, do not start the big table again, just use the previous results !

您可以使用ROLLUP来做到这一点.

You could do that with a ROLLUP.

这是一个可以在Postgres上运行的查询,可以满足您的需求.

Here's a query that will do what you need, running on Postgres.

为了使MySQL能够做到这一点,您将需要用临时表替换所有"WITH foo AS(SELECT ...)"语句.这很容易. MySQL内存中的临时表速度很快,不要害怕使用它们,因为这将使您可以重用前面步骤中的结果并节省大量计算.

In order to make MySQL do this, you are going to need to replace all "WITH foo AS (SELECT...)" statements with temporary tables. That's easy. MySQL in-memory temp tables are fast, don't be afraid to use them, since that will allow you to reuse results from the previous steps ans save a lot of computations.

我已经生成了随机的测试数据,似乎可以正常工作.在0.3秒内执行...

I've generated random-ish test data, seems to work. Executes in 0.3s...

WITH 
-- users of interest : target group
uids AS (
    SELECT DISTINCT user_id 
        FROM    options 
        JOIN    responses USING (option_id)
        WHERE   poll_id=22
    ),
-- votes of everyone and target group
votes AS (
    SELECT poll_id, option_id, sum(all_votes) AS all_votes, sum(target_votes) AS target_votes
        FROM (
            SELECT option_id, count(*) AS all_votes, count(uids.user_id) AS target_votes
                FROM        responses 
                LEFT JOIN   uids USING (user_id)
                GROUP BY option_id
        ) v
        JOIN    options     USING (option_id)
        GROUP BY poll_id, option_id
    ),
-- totals for all polls (reuse previous result)
totals AS (
    SELECT poll_id, sum(all_votes) AS all_votes, sum(target_votes) AS target_votes
        FROM votes
        GROUP BY poll_id
    ),
poll_options AS (
    SELECT poll_id, count(*) AS poll_option_count
        FROM options 
        GROUP BY poll_id
    )
-- reuse previous tables to get some stats
SELECT  *, ABS(total_percent - subgroup_percent) AS deviation
    FROM (
        SELECT
            poll_id,
            option_id,
            v.target_votes / v.all_votes AS subgroup_percent,
            t.target_votes / t.all_votes AS total_percent,
            poll_option_count
        FROM votes  v
        JOIN totals t           USING (poll_id)
        JOIN poll_options po    USING (poll_id)
    ) AS foo
    ORDER BY deviation DESC, poll_option_count DESC;

                                                                                  QUERY PLAN                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=14910.46..14910.56 rows=40 width=144) (actual time=299.844..299.862 rows=200 loops=1)
   Sort Key: (abs(((t.target_votes / t.all_votes) - (v.target_votes / v.all_votes)))), po.poll_option_count
   Sort Method:  quicksort  Memory: 52kB
   CTE uids
     ->  HashAggregate  (cost=1801.43..1850.52 rows=4909 width=4) (actual time=3.935..4.793 rows=4860 loops=1)
           ->  Nested Loop  (cost=0.00..1789.16 rows=4909 width=4) (actual time=0.029..2.555 rows=4860 loops=1)
                 ->  Seq Scan on options  (cost=0.00..3.50 rows=5 width=4) (actual time=0.008..0.032 rows=5 loops=1)
                       Filter: (poll_id = 22)
                 ->  Index Scan using responses_option_id_key on responses  (cost=0.00..344.86 rows=982 width=8) (actual time=0.012..0.298 rows=972 loops=5)
                       Index Cond: (public.responses.option_id = public.options.option_id)
   CTE votes
     ->  HashAggregate  (cost=13029.43..13032.43 rows=200 width=24) (actual time=298.255..298.317 rows=200 loops=1)
           ->  Hash Join  (cost=13019.68..13027.43 rows=200 width=24) (actual time=297.953..298.103 rows=200 loops=1)
                 Hash Cond: (public.responses.option_id = public.options.option_id)
                 ->  HashAggregate  (cost=13014.18..13017.18 rows=200 width=8) (actual time=297.839..297.879 rows=200 loops=1)
                       ->  Merge Left Join  (cost=399.13..11541.43 rows=196366 width=8) (actual time=9.301..230.467 rows=196366 loops=1)
                             Merge Cond: (public.responses.user_id = uids.user_id)
                             ->  Index Scan using responses_pkey on responses  (cost=0.00..8585.75 rows=196366 width=8) (actual time=0.015..121.971 rows=196366 loops=1)
                             ->  Sort  (cost=399.13..411.40 rows=4909 width=4) (actual time=9.281..22.044 rows=137645 loops=1)
                                   Sort Key: uids.user_id
                                   Sort Method:  quicksort  Memory: 420kB
                                   ->  CTE Scan on uids  (cost=0.00..98.18 rows=4909 width=4) (actual time=3.937..6.549 rows=4860 loops=1)
                 ->  Hash  (cost=3.00..3.00 rows=200 width=8) (actual time=0.095..0.095 rows=200 loops=1)
                       ->  Seq Scan on options  (cost=0.00..3.00 rows=200 width=8) (actual time=0.007..0.043 rows=200 loops=1)
   CTE totals
     ->  HashAggregate  (cost=5.50..8.50 rows=200 width=68) (actual time=298.629..298.640 rows=40 loops=1)
           ->  CTE Scan on votes  (cost=0.00..4.00 rows=200 width=68) (actual time=298.257..298.425 rows=200 loops=1)
   CTE poll_options
     ->  HashAggregate  (cost=4.00..4.50 rows=40 width=4) (actual time=0.091..0.101 rows=40 loops=1)
           ->  Seq Scan on options  (cost=0.00..3.00 rows=200 width=4) (actual time=0.005..0.020 rows=200 loops=1)
   ->  Hash Join  (cost=6.95..13.45 rows=40 width=144) (actual time=298.994..299.554 rows=200 loops=1)
         Hash Cond: (t.poll_id = v.poll_id)
         ->  CTE Scan on totals t  (cost=0.00..4.00 rows=200 width=68) (actual time=298.632..298.669 rows=40 loops=1)
         ->  Hash  (cost=6.45..6.45 rows=40 width=84) (actual time=0.335..0.335 rows=200 loops=1)
               ->  Hash Join  (cost=1.30..6.45 rows=40 width=84) (actual time=0.140..0.263 rows=200 loops=1)
                     Hash Cond: (v.poll_id = po.poll_id)
                     ->  CTE Scan on votes v  (cost=0.00..4.00 rows=200 width=72) (actual time=0.001..0.030 rows=200 loops=1)
                     ->  Hash  (cost=0.80..0.80 rows=40 width=12) (actual time=0.130..0.130 rows=40 loops=1)
                           ->  CTE Scan on poll_options po  (cost=0.00..0.80 rows=40 width=12) (actual time=0.093..0.119 rows=40 loops=1)
 Total runtime: 300.132 ms

这篇关于如何优化缓慢的MySQL查询以找到相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆