优化MySQL完全外部联接以处理大量数据 [英] Optimize MySQL Full outer join for massive amount of data

查看:87
本文介绍了优化MySQL完全外部联接以处理大量数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有以下mysql表(简化了一点)

We have the following mysql tables (simplified for going straight to the point)

CREATE TABLE `MONTH_RAW_EVENTS` (
  `idEvent` int(11) unsigned NOT NULL,
  `city` varchar(45) NOT NULL,
  `country` varchar(45) NOT NULL,
  `ts` datetime NOT NULL,
  `idClient` varchar(45) NOT NULL,
  `event_category` varchar(45) NOT NULL,
  ... bunch of other fields
  PRIMARY KEY (`idEvent`),
  KEY `idx_city` (`city`),
  KEY `idx_country` (`country`),
  KEY `idClient` (`idClient`),
) ENGINE=InnoDB;

CREATE TABLE `compilation_table` (
  `idClient` int(11) unsigned DEFAULT NULL,
  `city` varchar(200) DEFAULT NULL,
  `month` int(2) DEFAULT NULL,
  `year` int(4) DEFAULT NULL,
  `events_profile` int(10) unsigned NOT NULL DEFAULT '0',
  `events_others` int(10) unsigned NOT NULL DEFAULT '0',
  `events_total` int(10) unsigned NOT NULL DEFAULT '0',
  KEY `idx_month` (`month`),
  KEY `idx_year` (`year`),
  KEY `idx_idClient` (`idClient`),
  KEY `idx_city` (`city`)
) ENGINE=InnoDB;

MONTH_RAW_EVENTS包含将近2000万行用户在网站上执行操作的行,其大小将近4GB

MONTH_RAW_EVENTS contains almost 20M rows having user performed actions in a website, it sizes almost 4GB

compilation_table每个月都有一个摘要客户/城市,我们用它来实时显示网站上的统计信息

compilation_table has a summary clients/cities per each month, we use it for displaying stats on a website in real time

我们每月处理一次统计信息(从第一个表到第二个表),并且尝试优化执行该操作的查询(因为到目前为止,我们在PHP中处理所有需要花费大量时间的时间)

We process the statistics (from first table to second one) once per month, and we're trying to optimize a query that performs such operation (as until now we're processing everything in PHP which takes loong loong time)

这是我们想到的查询,当使用小的数据子集时,这似乎可以完成工作, 该问题需要花费超过6个小时才能处理全部数据

Here's the query we came up with, which seems doing the job when using small subsets of data, the problem that takes more than 6hours to process for the full set of data

INSERT INTO compilation_table (idClient,city,month,year,events_profile,events_others)


    SELECT  IFNULL(OTHERS.idClient,AP.idClient) as idClient,
            IF(IFNULL(OTHERS.city,AP.city)='','Others',IFNULL(OTHERS.city,AP.city)) as city,
        01,2014,
    IFNULL(AP.cnt,0) as events_profile,
        IFNULL(OTHERS.cnt,0) as events_others           

    FROM
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`!='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as OTHERS
 LEFT JOIN 
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient

 UNION

    SELECT  IFNULL(OTHERS.idClient,CLIPROFILE.idClient) as idClient,
            IF(IFNULL(OTHERS.city,CLIPROFILE.city)='','Others',IFNULL(OTHERS.city,CLIPROFILE.city)) as city,
            01,2014,
            IFNULL(CLIPROFILE.cnt,0) as events_profile,
            IFNULL(OTHERS.cnt,0) as events_others           
    FROM
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`!='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as OTHERS
 RIGHT JOIN 
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient

我们试图做的是在MySQL中进行完全外部联接,因此查询的基本模式如下:

What we're trying to do is a FULL Outer Join in Mysql so the basic schema of the query is like: the one proposed here

我们如何优化查询?我们一直在尝试使用不同的索引,将它们夹在中间,但是8小时后仍未完成运行,

How can we optimize the query? we've been trying diferent indexes, swiching things around but after 8 hours still didnt finished running,

MySQL服务器是Percona MySQL 5.5专用计算机,具有2cpu,2GB内存和SSD磁盘, 我们使用Percona工具优化了此类服务器的配置,

The MySQL server is a Percona MySQL 5.5 dedicated machine with 2cpu, 2GB ram, and SSD disk, we optimized the configuration of such server using Percona tools,

任何帮助将不胜感激,

谢谢

推荐答案

您正在做一个UNION,导致DISTINCT处理.

You're doing a UNION which results in DISTINCT processing.

通常最好将完全连接"重写为左连接"以及右连接"的不匹配行(如果合适的是1:n连接)

It's usually better to rewrite a Full Join to a Left Join plus the non-matching rows of a Right Join (if it's proper 1:n join)

OTHERS LEFT JOIN CLIPROFILE 
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
union all
OTHERS RIGHT JOIN CLIPROFILE 
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
WHERE OTHERS.idClient IS NULL 

另外,您可以在将临时表的结果具体化后再加入临时表,因此计算仅执行一次(我不知道MySQL的优化器是否足够聪明,可以自动执行此操作).

Additionally you might materialize the results of the Derived Tables in temp tables before joining them, thus the calculation is only done once (I don't know if MySQL's optimizer is smart enough to do that automatically).

另外,将城市/国家/地区分组并加入到单独的列中并在外部步骤中将 CONCAT(city,',',country)用作城市可能会更有效.

Plus it might be more efficient to group by and join on city/country as separate columns and do the CONCAT(city,', ',country) as city in the outer step.

这篇关于优化MySQL完全外部联接以处理大量数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆