SQL 父/子递归调用还是联合? [英] SQL Parent/Child recursive call or union?

查看:32
本文介绍了SQL 父/子递归调用还是联合?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我似乎找不到相关的例子.

我正在尝试返回一个表的子集,对于该表中的每一行,我想检查它有多少个子集,并将该数字作为结果集的一部分返回.

父表列:PK_ID, Column1, Column2, FK1

对于结果集中的每个 FK1,从 child_table 中选择 count(*).

最终结果集

3, col1text, col2text, 1(child)
5, col1texta, col2texta, 2(child)
6, col1textb, col2textb, 0(child)
9, col1textc, col2textc, 4(child)

我正在努力寻找在另一个查询中引用结果集中的列的最佳方法,然后再次将它们连接在一起.使用 T-sql

解决方案

好吧,显然,基于对另一个答案的支持,这需要进一步解释.示例(使用 MySQL 完成,因为我很方便,但该原则对任何 SQL 方言都是通用的):

创建表废话(ID INT 主键,SomeText VARCHAR(30),家长 ID INT)INSERT INTO Blah VALUES (1, 'One', 0);INSERT INTO Blah VALUES (2, 'Two', 0);插入废话值 (3, '三', 1);INSERT INTO Blah VALUES (4, 'Four', 1);INSERT INTO Blah VALUES (5, 'Five', 4);

左连接版本:

SELECT a.ID, a.SomeText, COUNT(1)FROM 废话 aJOIN Blah b ON a.ID= b.ParentIDGROUP BY a.ID, a.SomeText

错了.忽略没有孩子的情况.

左外连接:

SELECT a.ID, a.SomeText, COUNT(1)FROM 废话 aLEFT OUTER JOIN Blah b ON a.ID= b.ParentIDGROUP BY a.ID, a.SomeText

错误,原因有点微妙.COUNT(1) 计算 NULL 行而 COUNT(b.ID) 不会.所以以上是错误的,但这是正确的:

SELECT a.ID, a.SomeText, COUNT(b.ID)FROM 废话 aLEFT OUTER JOIN Blah b ON a.ID= b.ParentIDGROUP BY a.ID, a.SomeText

相关子查询:

SELECT ID, SomeText, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) ChildCountFROM 废话 a

也正确.

好的,那么使用哪个?计划只能告诉你这么多.子查询与left-joins 是一个旧的,没有基准测试就没有明确的答案.所以我们需要一些数据:

\n";mysql_connect('localhost', 'scratch', 'scratch');如果(mysql_error()){回声 mysql_error();出口();}mysql_select_db('scratch');如果(mysql_error()){回声 mysql_error();出口();}$count = 0;$limit = 1000000;$this_level = array(0);$next_level = array();而 ($count <$limit) {foreach ($this_level 作为 $parent) {$child_count = rand(0, 3);for ($i=0; $i<$child_count; $i++) {$count++;query("INSERT INTO Blah (ID, SomeText, ParentID) VALUES ($count, 'Text $count', $parent)");$next_level[] = $count;}}$this_level = $next_level;$next_level = array();}$stop = microtime(true);$duration = $stop - $start;$inserttime = $duration/$count;echo "已添加 $count 个用户.\n";echo "程序运行了 $duration 秒.\n";echo "插入时间 $inserttime 秒.\n";echo "</pre>\n";功能查询($查询){mysql_query($query);如果(mysql_error()){回声 mysql_error();出口();}}?>

我在这次运行期间内存不足 (32M),所以最终只有 876,109 条记录,但嘿,它会做的.后来,当我测试 Oracle 和 SQL Server 时,我将完全相同的数据集导入到 Oracle XE 和 SQL Server Express 2005 中.

现在另一张海报提出了我在查询周围使用计数包装器的问题.他正确地指出在这种情况下优化器可能不会执行子查询.MySQL 似乎没有那么聪明.甲骨文是.SQL Server 似乎也是如此.

所以我将为每个数据库查询组合引用两个数字:第一个包含在 SELECT COUNT(1) FROM ( ... ) 中,第二个是原始数据.

设置:

  • MySQL 5.0 使用 PremiumSoft Navicat(LIMIT 10000 in query);
  • 使用 Microsoft SQL Server Management Studio Express 的 SQL Server Express 2005;
  • 使用 PL/SQL Developer 7 的 Oracle XE(限制为 10,000 行).

左外连接:

SELECT a.ID, a.SomeText, COUNT(b.ID)FROM 废话 aLEFT OUTER JOIN Blah b ON a.ID= b.ParentIDGROUP BY a.ID, a.SomeText

  • MySQL: 5.0:51.469s/49.907s
  • SQL Server: 0(1)/9s(2)
  • Oracle XE:1.297 秒/2.656 秒

(1) 几乎瞬时(确认不同的执行路径)
(2) 令人印象深刻的是它正在返回所有行,而不是 10,000

只是去展示一个真实数据库的价值.此外,删除 SomeText 字段对 MySQL 的性能有重大影响.此外,10000 的限制与 MySQL 没有限制(将性能提高 4-5 倍)之间也没有太大区别.Oracle 之所以拥有它,只是因为 PL/SQL Developer 在达到 100M 内存使用量时就吐了.

相关子查询:

SELECT ID, SomeText, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) ChildCountFROM 废话 a

  • MySQL: 8.844s/11.10s
  • SQL Server: 0s/6s
  • 甲骨文: 0.046s/1.563s

因此 MySQL 的性能提高了 4-5 倍,Oracle 的速度大约是其两倍,而 SQL Server 可以说只是快了一点.

重点仍然是:相关子查询版本在所有情况下都更快.

相关子查询的另一个优点是它们在语法上更清晰且更易于扩展.我的意思是,如果您想在一堆其他表格中进行计数,则每个表格都可以干净而轻松地作为另一个选择项包含在内.例如:想象一下客户开具发票的记录,其中这些发票未付、逾期或已付.使用简单的子查询:

SELECT id,(SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'UNPAID') unpaid_invoices,(SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'OVERDUE') overdue_invoices,(SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'PAID')paid_invoices来自客户 c

聚合版本要丑得多.

现在我并不是说子查询总是优于聚合连接,但通常情况下,您必须对其进行测试.根据您的数据、该数据的大小以及您的 RDBMS 供应商,差异可能非常显着.

I can't seem to find a relevant example out there.

I'm trying to return a sub-set of a table, and for each row in that table, I want to check how many children it has, and return that number as part of the result set.

Parent Table Columns: PK_ID, Column1, Column2, FK1

For each FK1 in result set, select count(*) from child_table.

Final result set

3, col1text, col2text, 1(child)
5, col1texta, col2texta, 2(child)
6, col1textb, col2textb, 0(child)
9, col1textc, col2textc, 4(child)

I'm struggling with the best way to reference a column in the result set in another query, and then join them together again. Using T-sql

解决方案

Ok, apparently, based on the upvotes for the other answer, this needs further explanation. Example (done with MySQL because I have it handy but the principle is universal to any SQL dialect):

CREATE TABLE Blah (
  ID INT PRIMARY KEY,
  SomeText VARCHAR(30),
  ParentID INT
)

INSERT INTO Blah VALUES (1, 'One', 0);
INSERT INTO Blah VALUES (2, 'Two', 0);
INSERT INTO Blah VALUES (3, 'Three', 1);
INSERT INTO Blah VALUES (4, 'Four', 1);
INSERT INTO Blah VALUES (5, 'Five', 4);

Left join version:

SELECT a.ID, a.SomeText, COUNT(1)
FROM Blah a
JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText

Wrong. Ignores the case with no children.

Left outer join:

SELECT a.ID, a.SomeText, COUNT(1)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText

Wrong and the reason why is somewhat subtle. COUNT(1) counts NULL rows whereas COUNT(b.ID) doesn't. So the above is wrong but this is correct:

SELECT a.ID, a.SomeText, COUNT(b.ID)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText

Correlated subquery:

SELECT ID, SomeText, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) ChildCount
FROM Blah a

Also correct.

Ok, so which to use? Plans only tell you so much. The issue of subqueries vs left-joins is an old one and there's no clear answer without benchmarking it. So we need some data:

<?php
ini_set('max_execution_time', 180);

$start = microtime(true);

echo "<pre>\n";

mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
    echo mysql_error();
    exit();
}
mysql_select_db('scratch');
if (mysql_error()) {
    echo mysql_error();
    exit();
}

$count = 0;
$limit = 1000000;
$this_level = array(0);
$next_level = array();

while ($count < $limit) {
    foreach ($this_level as $parent) {
        $child_count = rand(0, 3);
        for ($i=0; $i<$child_count; $i++) {
            $count++;
            query("INSERT INTO Blah (ID, SomeText, ParentID) VALUES ($count, 'Text $count', $parent)");
            $next_level[] = $count;
        }
    }
    $this_level = $next_level;
    $next_level = array();
}

$stop = microtime(true);
$duration = $stop - $start;
$inserttime = $duration / $count;

echo "$count users added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $inserttime seconds.\n";
echo "</pre>\n";

function query($query) {
    mysql_query($query);
    if (mysql_error()) {
        echo mysql_error();
        exit();
    }
}
?>

I ran out of memory (32M) during this run so only ended up with 876,109 records but hey it will do. Later, when I test Oracle and SQL Server I take the exact same set of data and import it into Oracle XE and SQL Server Express 2005.

Now another poster raised the issue of my using a count wrapper around the queries. He correctly pointed out that the optimizer may not execute the subqueries in that case. MySQL doesn't seem to be that smart. Oracle is. SQL Server seems to be as well.

So I'll quote two figures for each database-query combination: the first is wrapped in SELECT COUNT(1) FROM ( ... ), the second is raw.

Setup:

  • MySQL 5.0 using PremiumSoft Navicat (LIMIT 10000 in query);
  • SQL Server Express 2005 using Microsoft SQL Server Management Studio Express;
  • Oracle XE using PL/SQL Developer 7 (limited to 10,000 rows).

Left outer join:

SELECT a.ID, a.SomeText, COUNT(b.ID)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText

  • MySQL: 5.0: 51.469s / 49.907s
  • SQL Server: 0(1) / 9s(2)
  • Oracle XE: 1.297s / 2.656s

(1) Virtually instantaneous (confirming the different execution path)
(2) Impressive considering it is returning all the rows, not 10,000

Just goes to show the value of a real database. Also, removing the SomeText field had a significant impact on MySQL's performance. Also there wasn't much difference between the limit of 10000 and not having it with MySQL (improving performance by a factor of 4-5). Oracle had it just because PL/SQL Developer barfed when it hit 100M memory usage.

Correlated Subquery:

SELECT ID, SomeText, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) ChildCount
FROM Blah a

  • MySQL: 8.844s / 11.10s
  • SQL Server: 0s / 6s
  • Oracle: 0.046s / 1.563s

So MySQL is better by a factor of 4-5, Oracle is about twice as fast and SQL Server is arguably only a little faster.

The point remains: the correlated subquery version is faster in all cases.

The other advantage of correlated subqueries is that they are syntactically cleaner and easier to extend. By this I mean that if you want to do a count in a bunch of other tables, each can be included as another select item cleanly and easily. For example: imagine a record of customers to invoices where those invoices were either unpaid, overdue or paid. With a subquery that is easy:

SELECT id,
  (SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'UNPAID') unpaid_invoices,
  (SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'OVERDUE') overdue_invoices,
  (SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'PAID') paid_invoices
FROM customers c

The aggregate version is a lot uglier.

Now I'm not saying that subqueries are always superior to aggregate joins but often enough they are that you have to test it. Depending on your data, the size of that data and your RDBMS vendor the difference can be hugely significant.

这篇关于SQL 父/子递归调用还是联合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆