选择超过总值百分比的行的子集 [英] Selecting a subset of rows that exceed a percentage of total values
问题描述
我有一张与客户,用户和收入类似的表(实际上成千上万条记录):
I have a table with customers, users and revenue similar to below (in reality thousands of records):
Customer User Revenue
001 James 500
002 James 750
003 James 450
004 Sarah 100
005 Sarah 500
006 Sarah 150
007 Sarah 600
008 James 150
009 James 100
我想做的是只返回支出最高的客户,这些用户占用户总收入的80%.
What I want to do is to return only the highest spending customers that make up 80% of the total revenue for the user.
要手动执行此操作,我将按James的客户的收入顺序对其进行排序,计算出总计的百分比和运行的总计百分比,然后仅返回记录,直到运行的总计达到80%:
To do this manually I would order James' customers by their revenue, work out the percentage of total and a running total percentage, then only return records up to the point that the running total hits 80%:
Customer User Revenue % of total Running Total %
002 James 750 0.38 0.38
001 James 500 0.26 0.64
003 James 450 0.23 0.87 <- Greater than 80%, last record
008 James 150 0.08 0.95
009 James 100 0.05 1.00
我已经尝试过使用CTE,但到目前为止还是空白.有什么方法可以通过单个查询而不是在Excel工作表中手动完成此操作?
I've tried using a CTE but so far have come up blank. Is there any way to do this via a single query rather than manually in an Excel sheet?
推荐答案
仅SQL Server 2012 +
SQL Server 2012+
only
您可以使用窗口化的 SUM
:
WITH cte AS
(
SELECT *,
1.0 * Revenue/SUM(Revenue) OVER(PARTITION BY [User]) AS percentile,
1.0 * SUM(Revenue) OVER(PARTITION BY [User] ORDER BY [Revenue] DESC)
/SUM(Revenue) OVER(PARTITION BY [User]) AS running_percentile
FROM tab
)
SELECT *
FROM cte
WHERE running_percentile <= 0.8;
SQL Server 2008:
SQL Server 2008:
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY Revenue DESC) AS rn
FROM t
), cte2 AS
(
SELECT c.Customer, c.[User], c.[Revenue]
,percentile = 1.0 * Revenue / NULLIF(c3.s,0)
,running_percentile = 1.0 * c2.s / NULLIF(c3.s,0)
FROM cte c
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM cte c2
WHERE c.[User] = c2.[User]
AND c2.rn <= c.rn) c2
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM cte c2
WHERE c.[User] = c2.[User]) AS c3
)
SELECT *
FROM cte2
WHERE running_percentile <= 0.8;
输出:
╔══════════╦═══════╦═════════╦════════════════╦════════════════════╗
║ Customer ║ User ║ Revenue ║ percentile ║ running_percentile ║
╠══════════╬═══════╬═════════╬════════════════╬════════════════════╣
║ 2 ║ James ║ 750 ║ 0,384615384615 ║ 0,384615384615 ║
║ 1 ║ James ║ 500 ║ 0,256410256410 ║ 0,641025641025 ║
║ 7 ║ Sarah ║ 600 ║ 0,444444444444 ║ 0,444444444444 ║
╚══════════╩═══════╩═════════╩════════════════╩════════════════════╝
看上去差不多,唯一的麻烦是缺少最后一行,詹姆斯的第三排让他超过0.80,但需要包括在内.
That looks nearly there, the only niggle is it's missing the last row, the third row for James takes him over 0.80 but needs to be included.
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY Revenue DESC) AS rn
FROM t
), cte2 AS
(
SELECT c.Customer, c.[User], c.[Revenue]
,percentile = 1.0 * Revenue / NULLIF(c3.s,0)
,running_percentile = 1.0 * c2.s / NULLIF(c3.s,0)
FROM cte c
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM cte c2
WHERE c.[User] = c2.[User]
AND c2.rn <= c.rn) c2
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM cte c2
WHERE c.[User] = c2.[User]) AS c3
)
SELECT a.*
FROM cte2 a
CROSS APPLY (SELECT MIN(running_percentile) AS rp
FROM cte2
WHERE running_percentile >= 0.8
AND cte2.[User] = a.[User]) AS s
WHERE a.running_percentile <= s.rp;
输出:
╔══════════╦═══════╦═════════╦════════════════╦════════════════════╗
║ Customer ║ User ║ Revenue ║ percentile ║ running_percentile ║
╠══════════╬═══════╬═════════╬════════════════╬════════════════════╣
║ 2 ║ James ║ 750 ║ 0,384615384615 ║ 0,384615384615 ║
║ 1 ║ James ║ 500 ║ 0,256410256410 ║ 0,641025641025 ║
║ 3 ║ James ║ 450 ║ 0,230769230769 ║ 0,871794871794 ║
║ 7 ║ Sarah ║ 600 ║ 0,444444444444 ║ 0,444444444444 ║
║ 5 ║ Sarah ║ 500 ║ 0,370370370370 ║ 0,814814814814 ║
╚══════════╩═══════╩═════════╩════════════════╩════════════════════╝
看上去很完美,已翻译成我的大桌子并返回了我需要的东西,花了5分钟时间完成它,仍然无法跟随你所做的事情!
SQL Server 2008
不支持 OVER()
子句中的所有内容,但 ROW_NUMBER
则支持.
SQL Server 2008
does not support everything in OVER()
clause, but ROW_NUMBER
does.
第一个cte只是计算组内的位置:
First cte just calculate position within a group:
╔═══════════╦════════╦══════════╦════╗
║ Customer ║ User ║ Revenue ║ rn ║
╠═══════════╬════════╬══════════╬════╣
║ 2 ║ James ║ 750 ║ 1 ║
║ 1 ║ James ║ 500 ║ 2 ║
║ 3 ║ James ║ 450 ║ 3 ║
║ 8 ║ James ║ 150 ║ 4 ║
║ 9 ║ James ║ 100 ║ 5 ║
║ 7 ║ Sarah ║ 600 ║ 1 ║
║ 5 ║ Sarah ║ 500 ║ 2 ║
║ 6 ║ Sarah ║ 150 ║ 3 ║
║ 4 ║ Sarah ║ 100 ║ 4 ║
╚═══════════╩════════╩══════════╩════╝
第二cte:
-
c2
子查询根据ROW_NUMBER
中的排名计算运行总计 -
c3
计算每个用户的总和
c2
subquery calculate running total based on rank fromROW_NUMBER
c3
calculate full sum per user
在最终查询 s
子查询中,发现运行总数最低的
总和超过了80%.
In final query s
subquery finds the lowest running
total that exceeds 80%.
使用 ROW_NUMBER
实际上是多余的.
WITH cte AS
(
SELECT c.Customer, c.[User], c.[Revenue]
,percentile = 1.0 * Revenue / NULLIF(c3.s,0)
,running_percentile = 1.0 * c2.s / NULLIF(c3.s,0)
FROM t c
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM t c2
WHERE c.[User] = c2.[User]
AND c2.Revenue >= c.Revenue) c2
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM t c2
WHERE c.[User] = c2.[User]) AS c3
)
SELECT a.*
FROM cte a
CROSS APPLY (SELECT MIN(running_percentile) AS rp
FROM cte c2
WHERE running_percentile >= 0.8
AND c2.[User] = a.[User]) AS s
WHERE a.running_percentile <= s.rp
ORDER BY [User], Revenue DESC;
这篇关于选择超过总值百分比的行的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!