分区非常大的INNER JOIN SQL查询 [英] Partition a very large INNER JOIN SQL query

查看：217 发布时间：2018/6/1 12:39:56 mysql hadoop join bigdata spring-xd

本文介绍了分区非常大的INNER JOIN SQL查询的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

sql查询是相当标准的内部连接类型。
例如，比较n个表来查看所有n个表中存在哪个customerId将是基本的WHERE ... AND类型查询。

问题是大小的表格是> 1000万条记录。数据库是非规范化的。标准化不是一种选择。
查询要么需要很长时间才能完成，要么从未完成。

我不确定它是否相关，但我们使用spring xd作业模块用于其他类型。

我不确定如何对这类作业进行分区，以便它可以并行运行，从而缩短了处理时间，因此，如果步骤/ 其他具有类似问题的帖子建议使用数据库引擎之外的替代方法，如在代码中实现LOOP JOIN或使用MapReduce或Hadoop，从来没有使用过，我不确定他们是否值得研究这个用例。

这种操作的标准方法是什么，我会期望它是相当普遍的。我可能会使用错误的搜索条件来研究方法，因为我还没有遇到任何股票标准解决方案或明确的指示。

相当神秘的原始要求是： p>

比较三个非常大的表中的party_id列，以确定三个表
中可用的客户，即，如果它是三个之间的AND操作。
SAMPLE1.PARTY_ID AND SAMPLE2.PARTY_ID和SAMPLE3.PARTY_ID

如果操作是OR，则选择三个表中可用的所有客户。
SAMPLE1.PARTY_ID或SAMPLE2.PARTY_ID或SAMPLE3.PARTY_ID
在表格之间使用AND / OR，然后根据需要执行比较。 SAMPLE1.PARTY_ID和SAMPLE2.PARTY_ID或SAMPLE3.PARTY_ID

我用这个定义分别设置了4个测试表

CREATE TABLE TABLE1（ `CREATED` datetime DEFAULT NULL， `PARTY_ID` varchar（45）NOT NULL， `GROUP_ID` varchar（45）NOT NULL， `SEQUENCE_ID` int（11）NOT NULL AUTO_INCREMENT， PRIMARY KEY（`SEQUENCE_ID`））ENGINE = InnoDB AUTO_INCREMENT = 978536 DEFAULT CHARSET = latin1;
然后向应该导致连接的范围中的每个随机数添加1,000,000条记录。

我使用了以下测试查询

SELECT`TABLE1`.`PARTY_ID`因为`pi1`，`TABLE2`.`PARTY_ID`由于`pi2`，`TABLE3`.`PARTY_ID`由于`pi3`，`TABLE4`.`PARTY_ID`由`pi4`由`devt1`.`TABLE2`由于` TABLE2`，`devt1`.`TABLE1`AS` TABLE1`，`devt1`.`TABLE3`AS` TABLE3`，`devt1`.`TABLE4`AS` TABLE4` WHERE'TABLE2`.`PARTY_ID```` TABLE1`， .PARTY_ID` AND` TABLE3`.`PARTY_ID` =`TABLE2`.`PARTY_ID` AND` TABLE4`.`PARTY_ID```` TABLE3`.`PARTY_ID`
它应该在10分钟内完成，桌面尺寸要大10倍。
我的测试查询仍然没有完成，并且已经运行了15分钟

解决方案
以下可能比现有的基于连接的查询执行得更好：

从中选择party_id（选择不同的party_id从SAMPLE1 union all 从SAMPLE2 union中选择不同的party_id all 从SAMPLE3中选择不同的party_id）作为ilv 由party_id组成有count（*）= 3
修改 count（*）条件以匹配表的数量被查询。

如果您想返回 中存在的 party_id 而不是全部，然后省略最后的，其中有子句。

The sql query is fairly standard inner join type. For example comparing n tables to see which customerId's exist in all n tables would be a basic WHERE ... AND type query.

The problem is the size of the tables are > 10 million records. The database is denormalized. Normalization is not an option. The query either takes to long to complete or never completes.

I'm not sure if it's relevant but we are using spring xd job modules for other types of queries.

I'm not sure how to partition this sort of job so that it can be run in parallel so that it takes less time and so if a step/subsection fails it can continue from where it left off.

Other posts with similar problem suggest using alternative methods besides the database engine like implementing a LOOP JOIN in code or using MapReduce or Hadoop, having never used either I'm unsure if they are worth looking into for this use case.

What is the standard approach to this sort of operation, I'd expect it to be fairly common. I might be using the wrong search terms to research approaches because I haven't come across any stock standard solutions or clear directions.

The rather cryptic original requirement was:

Compare party_id column in the three very large tables to identify the customer available in three table i.e if it is AND operation between three. SAMPLE1.PARTY_ID AND SAMPLE2.PARTY_ID AND SAMPLE3.PARTY_ID

If the operation is OR then pick all the customers available in the three tables. SAMPLE1.PARTY_ID OR SAMPLE2.PARTY_ID OR SAMPLE3.PARTY_ID

AND / OR are used between tables then performed the comparison as required. SAMPLE1.PARTY_ID AND SAMPLE2.PARTY_ID OR SAMPLE3.PARTY_ID

I set up some 4 test tables each with with this definition
CREATE TABLE `TABLE1` ( `CREATED` datetime DEFAULT NULL, `PARTY_ID` varchar(45) NOT NULL, `GROUP_ID` varchar(45) NOT NULL, `SEQUENCE_ID` int(11) NOT NULL AUTO_INCREMENT, PRIMARY KEY (`SEQUENCE_ID`) ) ENGINE=InnoDB AUTO_INCREMENT=978536 DEFAULT CHARSET=latin1;
Then added 1,000,000 records to each just random numbers in a range that should result in joins.

I used the following test query
SELECT `TABLE1`.`PARTY_ID` AS `pi1`, `TABLE2`.`PARTY_ID` AS `pi2`, `TABLE3`.`PARTY_ID` AS `pi3`, `TABLE4`.`PARTY_ID` AS `pi4` FROM `devt1`.`TABLE2` AS `TABLE2`, `devt1`.`TABLE1` AS `TABLE1`, `devt1`.`TABLE3` AS `TABLE3`, `devt1`.`TABLE4` AS `TABLE4` WHERE `TABLE2`.`PARTY_ID` = `TABLE1`.`PARTY_ID` AND `TABLE3`.`PARTY_ID` = `TABLE2`.`PARTY_ID` AND `TABLE4`.`PARTY_ID` = `TABLE3`.`PARTY_ID`
It's supposed to complete in under 10 min and for table sizes 10x larger. My test query still hasn't completed and it has been running for 15 min
解决方案
The following may perform better than the existing join-based query:
select party_id from (select distinct party_id from SAMPLE1 union all select distinct party_id from SAMPLE2 union all select distinct party_id from SAMPLE3) as ilv group by party_id having count(*) = 3
Amend the count(*) condition to match the number of tables being queried.

If you want to return party_id values that are present in any table rather than all, then omit the final having clause.

这篇关于分区非常大的INNER JOIN SQL查询的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

分区非常大的INNER JOIN SQL查询 [英] Partition a very large INNER JOIN SQL query

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

分区非常大的INNER JOIN SQL查询 [英] Partition a very large INNER JOIN SQL query

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭