SQL Server-具有关联的条件聚合 [英] SQL Server - conditional aggregation with correlation
问题描述
背景:
原始案例非常简单。计算从最高收入到最低收入的每位用户总运行量:
创建表t(客户整数非空主键
, 用户 VARCHAR(5)非空
,收入整数不为空);
插入t(Customer, User,Revenue)值
(001,'James',500),(002,'James',750),(003,'James ',450),
(004,'Sarah',100),(005,'Sarah',500),(006,'Sarah',150),
(007,'Sarah', 600),(008,'James',150),(009,'James',100);
查询:
SELECT *,
1.0 *收入/总和(收入)超过(按用户划分)AS百分比,
1.0 *总和(收入)超过(按(用户)划分)ORDER BY收入DESC)
/总和(收入)超过(按用户划分的部分)AS running_percentage
FROM t;
输出:
╔════╦═══════╦═════════╦══ ══════════╦════════════════════╗
║ID║用户║收入║百分比║running_percentage║$ b $ ╠════╬═══════╬═════════╬════════════╬════════════ ════════╣
║2║詹姆斯║750║0.38║0.38║
║1║詹姆斯║500║0.26║0.64║
║3║詹姆斯║450║ 0.23║0.87║
║8║James║150║0.08║0.95║
║9║James║100║0.05║1║
║7║Sarah║600║0.44║0.44║
║5║Sarah║500║0.37║0.81║
║6║Sarah║150║0.11║0.93║
║4║Sarah║100║0.07║1║
╚═ ═══牛皮═══════牛皮═ b牛皮════════════牛皮════════════════════╝
可以使用特定的窗口函数来进行不同的计算。
现在让我们假设我们不能使用窗口化的 SUM
并重写它:
SELECT c。客户,c。用户,c。收入
,1.0 *收入/ NULLIF(c3.s,0)AS百分比
,1.0 * c2.s / NULLIF(c3.s,0)AS running_percentage
从tc
交叉应用
(SELECT c。用户 = c2。用户
AND c2。收入> = c。收入)AS c2
从t c2
的SUM(Revenue)AS s
交叉应用
(选择SUM(Revenue)AS s
,从t c2
c。 User = c2。 User)作为c3
ORDER BY User,收入DESC;
我有使用交叉应用
是因为我不喜欢 SELECT
列列表和 c3 中的相关子查询code>使用两次。
一切正常。但是当我们仔细观察时, c2
和 c3
非常相似。那么,为什么不合并它们并使用简单的条件聚合呢?
SELECT c.Customer,c。 User,c。 Revenue
,1.0 *收入/ NULLIF(c2.sum_total,0)AS百分比
,1.0 * c2.sum_running / NULLIF(c2.sum_total,0)AS running_percentage
FROM tc
交叉应用
(选择总和(收入)AS sum_total,
总和(以c2.Revenue> = c.Revenue THESE Revenue ELSE 0 END为例)
AS sum_running
FROM t c2
在哪里c。用户 = c2。用户)AS c2
按用户订购,收入DESC;
不幸的是,这是不可能的。
在包含外部引用的聚合表达式中指定了多个列。如果要聚合的表达式包含外部引用,则该外部引用必须是表达式中引用的唯一列。
可能会绕过它与另一个子查询的包装,但是会变得有些丑陋:
SELECT c.Customer,c。用户,c。收入
,1.0 *收入/ NULLIF(c2.sum_total,0)AS百分比
,1.0 * c2.sum_running / NULLIF(c2.sum_total,0)AS running_percentage
FROM tc
交叉应用
(SELECT SUM(Revenue)AS sum_total,
SUM(running_revenue)AS sum_running
FROM(SELECT Revenue,
在c2.Revenue的情况下> = c。收入然后其他收入0结束
AS running_revenue
从t c2
在哪里c。用户 = c2。用户)AS sub
)AS c2
按用户订购,收入DESC
PostgreSQL
版本。唯一的区别是 LATERAL
而不是 CROSS APPLY
。
SELECT c。客户,c。用户,c。收入
,1.0 *收入/ NULLIF(c2.sum_total,0)AS百分比
,1.0 * c2.running_sum / NULLIF(c2.sum_total,0)AS running_percentage
从tc
,横向(SELECT SUM(Revenue)AS sum_total,
SUM(在c2.Revenue> = c的情况下。收入然后c2。收入其他0结束)
AS running_sum
从t c2
在哪里c。用户 = c2。用户)c2
按用户订购,收益DESC;
它非常好用。
SQLite
/ MySQL
版本(这就是为什么我更喜欢横向/交叉应用
)的原因:
选择c.Customer,c。 User,c.Revenue,
1.0 *收入/(SELECT SUM(Revenue)
FROM t c2
W. c。用户 = c2。用户)AS百分比,
1.0 *(选择总和(在c2.Revenue> = c.Revenue THEN c2.Revenue ELSE 0 END的情况下) )
从t c2
在哪里c。用户 = c2。用户)/
(SELECT SUM(c2.Revenue)
从t c2
在哪里c。用户 = c2。用户)AS running_percentage
从tc
按用户订购,收益DESC;
SQLFiddleDemo-SQLite
< a href = http://sqlfiddle.com/#!9/c85e83/5/0 rel = nofollow noreferrer> SQLFiddleDemo-MySQL
我已阅读具有外部引用的聚合:
限制在
SQL-92
标准中,并且SQL Server
从Sybase继承
代码库。问题是SQL Server需要弄清楚哪个查询将计算聚合。
我不搜索仅显示如何规避它。
问题是:
- 标准的哪个部分不允许或干扰它?
- 为什么其他RDBMS对此外部依赖项没有问题?
- 它们扩展
SQL Standard
和SQL Server
的行为或SQL Server
不能完全实现(正确吗?)?。
我非常感谢您引用:
-
ISO标准
(92或更高版本) - SQL Server标准支持
- 任何解释它的RDBMS的官方文档(
SQL Server / Postgresql / Oracle /...
)。
编辑:
我知道 SQL-92
没有 LATERAL
的概念。但是带有子查询的版本(例如在 SQLite / MySQL
中)也不起作用。
LiveDemo
p>
编辑2:
为了简化一点,让我们仅检查相关子查询:
SELECT c.Customer,c。 User,c.Revenue,
1.0 *(SELECT SUM(在c2.Revenue> = c.Revenue THEN c2.Revenue ELSE 0 END的情况下c。 User = c2。 User)
/(SELECT SUM(c2.Revenue)
FROM t c2
c。 User = c2。 User)AS running_percentage
FROM tc
ORDER BY User,收益DESC ;
以上版本在 MySQL / SQLite / Postgresql $ c中工作正常$ c>。
在 SQL Server
中,我们会收到错误消息。将其与子查询一起包装以将其展平到一个级别后,它就会起作用:
SELECT c.Customer,c。用户 ,c.Revenue,
1.0 *(
SELECT SUM(r1> = r2然后r1 ELSE 0 END的情况)
FROM(SELECT c2.Revenue AS r1,c.Revenue r2
从t c2
在哪里c。用户 = c2。用户)AS S)/
(SELECT SUM(c2.Revenue)
从t c2
c。用户 = c2。用户)AS running_percentage
从tc
按用户订购,收入DESC;
这个问题的重点是 SQL标准调节它。
LATERAL
的SQL标准。 CROSS APPLY
是Microsoft的特定于供应商的扩展(Oracle后来为了兼容性而采用了该扩展),并且由于MS功能,其限制显然不是由于ISO / IEC SQL标准引起的早于标准。
LATERAL
基本上只是联接的修饰符,允许联接树中的横向引用。可以引用的列数没有限制。
我看不出出现奇数限制的原因。也许是因为 CROSS APPLY
最初旨在允许使用表值函数,后来又扩展为允许子 SELECT
s。
Postgres 手册解释 LATERAL
像这样:
LATERAL
关键字可以在子<SELECT FROM
项之前。这允许
子SELECT
引用出现在$ b $之前的FROM
项的列b在FROM
列表中。 (在没有LATERAL
的情况下,每个子<SELECT
都会被独立评估
,因此不能交叉引用任何其他FROM
项目。)
查询的Postgres版本(没有更优雅的样式)窗口函数)可以更简单:
SELECT c。*
,round(revenue / c2.sum_total,2) END AS百分比
,round(c2.running_sum / c2.sum_total,2)END AS running_percentage
从tc,横向(
SELECT NULLIF(SUM(revenue),0):: numeric AS sum_total-NULLIF,强制转换一次
,SUM(收入)过滤器(收入> c =收入)AS running_sum
FROM t
WHERE User = c。 User
)c2
订单由c。用户,c.revenue DESC;
-
Postgres 9.4+具有更优雅的总体
FILTER
表示条件聚合。 -
NULLIF
很有意义,我只建议稍作简化。 -
将
sum_total
转换为数字
一次。 -
舍入结果与您想要的结果相匹配。
Background:
The original case was very simple. Calculate running total per user from highest revenue to lowest:
CREATE TABLE t(Customer INTEGER NOT NULL PRIMARY KEY
,"User" VARCHAR(5) NOT NULL
,Revenue INTEGER NOT NULL);
INSERT INTO t(Customer,"User",Revenue) VALUES
(001,'James',500),(002,'James',750),(003,'James',450),
(004,'Sarah',100),(005,'Sarah',500),(006,'Sarah',150),
(007,'Sarah',600),(008,'James',150),(009,'James',100);
Query:
SELECT *,
1.0 * Revenue/SUM(Revenue) OVER(PARTITION BY "User") AS percentage,
1.0 * SUM(Revenue) OVER(PARTITION BY "User" ORDER BY Revenue DESC)
/SUM(Revenue) OVER(PARTITION BY "User") AS running_percentage
FROM t;
Output:
╔════╦═══════╦═════════╦════════════╦════════════════════╗
║ ID ║ User ║ Revenue ║ percentage ║ running_percentage ║
╠════╬═══════╬═════════╬════════════╬════════════════════╣
║ 2 ║ James ║ 750 ║ 0.38 ║ 0.38 ║
║ 1 ║ James ║ 500 ║ 0.26 ║ 0.64 ║
║ 3 ║ James ║ 450 ║ 0.23 ║ 0.87 ║
║ 8 ║ James ║ 150 ║ 0.08 ║ 0.95 ║
║ 9 ║ James ║ 100 ║ 0.05 ║ 1 ║
║ 7 ║ Sarah ║ 600 ║ 0.44 ║ 0.44 ║
║ 5 ║ Sarah ║ 500 ║ 0.37 ║ 0.81 ║
║ 6 ║ Sarah ║ 150 ║ 0.11 ║ 0.93 ║
║ 4 ║ Sarah ║ 100 ║ 0.07 ║ 1 ║
╚════╩═══════╩═════════╩════════════╩════════════════════╝
It could be calculated differently using specific windowed functions.
Now let's assume that we cannot use windowed SUM
and rewrite it:
SELECT c.Customer, c."User", c."Revenue"
,1.0 * Revenue / NULLIF(c3.s,0) AS percentage
,1.0 * c2.s / NULLIF(c3.s,0) AS running_percentage
FROM t c
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM t c2
WHERE c."User" = c2."User"
AND c2.Revenue >= c.Revenue) AS c2
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM t c2
WHERE c."User" = c2."User") AS c3
ORDER BY "User", Revenue DESC;
I have used CROSS APPLY
because I do not like correlated subqueries in SELECT
colums list and c3
is used twice.
Everything work as it should. But when we look closer c2
and c3
are very similiar. So why not combine them and use simple conditional aggregation:
SELECT c.Customer, c."User", c."Revenue"
,1.0 * Revenue / NULLIF(c2.sum_total,0) AS percentage
,1.0 * c2.sum_running / NULLIF(c2.sum_total,0) AS running_percentage
FROM t c
CROSS APPLY
(SELECT SUM(Revenue) AS sum_total,
SUM(CASE WHEN c2.Revenue >= c.Revenue THEN Revenue ELSE 0 END)
AS sum_running
FROM t c2
WHERE c."User" = c2."User") AS c2
ORDER BY "User", Revenue DESC;
Unfortunately it is not possible.
Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.
Of course I could circumvent it wrapping with another subquery, but it becomes a bit "ugly":
SELECT c.Customer, c."User", c."Revenue"
,1.0 * Revenue / NULLIF(c2.sum_total,0) AS percentage
,1.0 * c2.sum_running / NULLIF(c2.sum_total,0) AS running_percentage
FROM t c
CROSS APPLY
( SELECT SUM(Revenue) AS sum_total,
SUM(running_revenue) AS sum_running
FROM (SELECT Revenue,
CASE WHEN c2.Revenue >= c.Revenue THEN Revenue ELSE 0 END
AS running_revenue
FROM t c2
WHERE c."User" = c2."User") AS sub
) AS c2
ORDER BY "User", Revenue DESC
Postgresql
version. The only difference is LATERAL
instead of CROSS APPLY
.
SELECT c.Customer, c."User", c.Revenue
,1.0 * Revenue / NULLIF(c2.sum_total,0) AS percentage
,1.0 * c2.running_sum / NULLIF(c2.sum_total,0) AS running_percentage
FROM t c
,LATERAL (SELECT SUM(Revenue) AS sum_total,
SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END)
AS running_sum
FROM t c2
WHERE c."User" = c2."User") c2
ORDER BY "User", Revenue DESC;
It works very nice.
SQLite
/MySQL
version (that is why I preferLATERAL/CROSS APPLY
):
SELECT c.Customer, c."User", c.Revenue,
1.0 * Revenue / (SELECT SUM(Revenue)
FROM t c2
WHERE c."User" = c2."User") AS percentage,
1.0 * (SELECT SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END)
FROM t c2
WHERE c."User" = c2."User") /
(SELECT SUM(c2.Revenue)
FROM t c2
WHERE c."User" = c2."User") AS running_percentage
FROM t c
ORDER BY "User", Revenue DESC;
SQLFiddleDemo-SQLite
SQLFiddleDemo-MySQL
I've read Aggregates with an Outer Reference:
The source for the restriction is in the
SQL-92
standard, andSQL Server
inherited it from theSybase
codebase. The problem is that SQL Server needs to figure out which query will compute the aggregate.
I do not search for answers that only show how to circumvent it.
The questions are:
- Which part of standard disallow or interfere with it?
- Why other RDBMSes do not have problem with this kind of outer dependency?
- Do they extend
SQL Standard
andSQL Server
behaves as it should orSQL Server
does not implement it fully(correctly?)?.
I would be very grateful for references to:
ISO standard
(92 or newer)- SQL Server Standards Support
- official documenation from any RDBMS that explains it (
SQL Server/Postgresql/Oracle/...
).
EDIT:
I know that SQL-92
does not have concept of LATERAL
. But version with subqueries (like in SQLite/MySQL
) does not work too.
EDIT 2:
To simplify it a bit, let's check only correlated subquery only:
SELECT c.Customer, c."User", c.Revenue,
1.0*(SELECT SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END)
FROM t c2
WHERE c."User" = c2."User")
/ (SELECT SUM(c2.Revenue)
FROM t c2
WHERE c."User" = c2."User") AS running_percentage
FROM t c
ORDER BY "User", Revenue DESC;
The version above works fine in MySQL/SQLite/Postgresql
.
In SQL Server
we get error. After wraping it with subquery to "flatten" it to one level it works:
SELECT c.Customer, c."User", c.Revenue,
1.0 * (
SELECT SUM(CASE WHEN r1 >= r2 THEN r1 ELSE 0 END)
FROM (SELECT c2.Revenue AS r1, c.Revenue r2
FROM t c2
WHERE c."User" = c2."User") AS S) /
(SELECT SUM(c2.Revenue)
FROM t c2
WHERE c."User" = c2."User") AS running_percentage
FROM t c
ORDER BY "User", Revenue DESC;
The point of this question is how does SQL standard
regulate it.
There is no such limitation in the SQL standard for LATERAL
. CROSS APPLY
is a vendor-specific extension from Microsoft (Oracle adopted it later for compatibility) and its limitations are obviously not owed to the ISO/IEC SQL standard, since the MS feature pre-dates the standard.
LATERAL
in standard SQL is basically just a modifier for joins to allow lateral references in the join tree. There is no limit to the number of columns that can be referenced.
I wouldn't see a reason for the odd restriction to begin with. Maybe it's because CROSS APPLY
was originally intended to allow table-valued functions, which was later extended to allow sub-SELECT
s.
The Postgres manual explains LATERAL
like this:
The
LATERAL
key word can precede a sub-SELECT FROM
item. This allows the sub-SELECT
to refer to columns ofFROM
items that appear before it in theFROM
list. (WithoutLATERAL
, each sub-SELECT
is evaluated independently and so cannot cross-reference any otherFROM
item.)
The Postgres version of your query (without the more elegant window functions) can be simpler:
SELECT c.*
, round(revenue / c2.sum_total, 2) END AS percentage
, round(c2.running_sum / c2.sum_total, 2) END AS running_percentage
FROM t c, LATERAL (
SELECT NULLIF(SUM(revenue), 0)::numeric AS sum_total -- NULLIF, cast once
, SUM(revenue) FILTER (WHERE revenue >= c.revenue) AS running_sum
FROM t
WHERE "User" = c."User"
) c2
ORDER BY c."User", c.revenue DESC;
Postgres 9.4+ has the more elegant aggregate
FILTER
for conditional aggregates.NULLIF
makes sense, I only suggest a minor simplification.Cast
sum_total
tonumeric
once.Round result to match your desired result.
这篇关于SQL Server-具有关联的条件聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!