按月按组获取数据集的百分位数 [英] Get percentiles of data-set with group by month
问题描述
我有一个包含大量记录的 SQL 表,如下所示:
I have a SQL table with a whole load of records that look like this:
| Date | Score |
+ -----------+-------+
| 01/01/2010 | 4 |
| 02/01/2010 | 6 |
| 03/01/2010 | 10 |
...
| 16/03/2010 | 2 |
我将其绘制在图表上,因此我在图表上看到一条漂亮的线,表示随时间推移的得分.可爱.
I'm plotting this on a chart, so I get a nice line across the graph indicating score-over-time. Lovely.
现在,我需要做的是在图表上包含平均分数,这样我们就可以看到随着时间的推移它是如何变化的,所以我可以简单地将其添加到组合中:
Now, what I need to do is include the average score on the chart, so we can see how that changes over time, so I can simply add this to the mix:
SELECT
YEAR(SCOREDATE) 'Year', MONTH(SCOREDATE) 'Month',
MIN(SCORE) MinScore,
AVG(SCORE) AverageScore,
MAX(SCORE) MaxScore
FROM SCORES
GROUP BY YEAR(SCOREDATE), MONTH(SCOREDATE)
ORDER BY YEAR(SCOREDATE), MONTH(SCOREDATE)
目前没问题.
问题是,如何轻松计算每个时间段的百分位数?我不确定这是正确的短语.我总共需要的是:
The problem is, how can I easily calculate the percentiles at each time-period? I'm not sure that's the correct phrase. What I need in total is:
- 图表上的分数线(简单)
- 图表上的平均线(简单)
- 图表上的一条线显示了 95% 的分数占据的范围(难倒)
这是我没有得到的第三个.我需要计算 5% 的百分位数,我可以单独计算:
It's the third one that I don't get. I need to calculate the 5% percentile figures, which I can do singly:
SELECT MAX(SubQ.SCORE) FROM
(SELECT TOP 45 PERCENT SCORE
FROM SCORES
WHERE YEAR(SCOREDATE) = 2010 AND MONTH(SCOREDATE) = 1
ORDER BY SCORE ASC) AS SubQ
SELECT MIN(SubQ.SCORE) FROM
(SELECT TOP 45 PERCENT SCORE
FROM SCORES
WHERE YEAR(SCOREDATE) = 2010 AND MONTH(SCOREDATE) = 1
ORDER BY SCORE DESC) AS SubQ
但我不知道如何获得所有月份的表格.
But I can't work out how to get a table of all the months.
| Date | Average | 45% | 55% |
+ -----------+---------+-----+-----+
| 01/01/2010 | 13 | 11 | 15 |
| 02/01/2010 | 10 | 8 | 12 |
| 03/01/2010 | 5 | 4 | 10 |
...
| 16/03/2010 | 7 | 7 | 9 |
目前我将不得不将这些加载到我的应用程序中,并自己计算这些数字.或者运行大量单独的查询并整理结果.
At the moment I'm going to have to load this lot up into my app, and calculate the figures myself. Or run a larger number of individual queries and collate the results.
推荐答案
哇.这是一个真正的脑筋急转弯.首先,我用于测试的表架构是:
Whew. This was a real brain teaser. First, my table schema for testing was:
Create Table Scores
(
Id int not null identity(1,1) primary key clustered
, [Date] datetime not null
, Score int not null
)
现在,首先,我使用 SQL 2008 中的 CTE 计算值以检查我的答案,然后我构建了一个应该在 SQL 2000 中工作的解决方案.因此,在 SQL 2008 中,我们执行以下操作:
Now, first, I calculated the values using a CTE in SQL 2008 in order to check my answers and then I built a solution that should work in SQL 2000. So, in SQL 2008 we do something like:
;With
SummaryStatistics As
(
Select Year([Date]) As YearNum
, Month([Date]) As MonthNum
, Min(Score) As MinScore
, Max(Score) As MaxScore
, Avg(Score) As AvgScore
From Scores
Group By Month([Date]), Year([Date])
)
, Percentiles As
(
Select Year([Date]) As YearNum
, Month([Date]) As MonthNum
, Score
, NTile( 100 ) Over ( Partition By Month([Date]), Year([Date]) Order By Score ) As Percentile
From Scores
)
, ReportedPercentiles As
(
Select YearNum, MonthNum
, Min(Case When Percentile = 45 Then Score End) As Percentile45
, Min(Case When Percentile = 55 Then Score End) As Percentile55
From Percentiles
Where Percentile In(45,55)
Group By YearNum, MonthNum
)
Select SS.YearNum, SS.MonthNum
, SS.MinScore, SS.MaxScore, SS.AvgScore
, RP.Percentile45, RP.Percentile55
From SummaryStatistics As SS
Join ReportedPercentiles As RP
On RP.YearNum = SS.YearNum
And RP.MonthNum = SS.MonthNum
Order By SS.YearNum, SS.MonthNum
现在是 SQL 2000 解决方案.本质上,诀窍是使用几个临时表来统计分数的出现次数.
Now for a SQL 2000 solution. In essence, the trick is to use a couple of temporary tables to tally the occurances of the scores.
If object_id('tempdb..#Working') is not null
DROP TABLE #Working
GO
Create Table #Working
(
YearNum int not null
, MonthNum int not null
, Score int not null
, Occurances int not null
, Constraint PK_#Working Primary Key Clustered ( MonthNum, YearNum, Score )
)
GO
Insert #Working(MonthNum, YearNum, Score, Occurances)
Select Month([Date]), Year([Date]), Score, Count(*)
From Scores
Group By Month([Date]), Year([Date]), Score
GO
If object_id('tempdb..#SummaryStatistics') is not null
DROP TABLE #SummaryStatistics
GO
Create Table #SummaryStatistics
(
MonthNum int not null
, YearNum int not null
, Score int not null
, Occurances int not null
, CumulativeTotal int not null
, Percentile float null
, Constraint PK_#SummaryStatistics Primary Key Clustered ( MonthNum, YearNum, Score )
)
GO
Insert #SummaryStatistics(YearNum, MonthNum, Score, Occurances, CumulativeTotal)
Select W2.YearNum, W2.MonthNum, W2.Score, W2.Occurances, Sum(W1.Occurances)-W2.Occurances
From #Working As W1
Join #Working As W2
On W2.YearNum = W1.YearNum
And W2.MonthNum = W1.MonthNum
Where W1.Score <= W2.Score
Group By W2.YearNum, W2.MonthNum, W2.Score, W2.Occurances
Update #SummaryStatistics
Set Percentile = SS.CumulativeTotal * 100.0 / MonthTotal.Total
From #SummaryStatistics As SS
Join (
Select SS1.YearNum, SS1.MonthNum, Max(SS1.CumulativeTotal) As Total
From #SummaryStatistics As SS1
Group By SS1.YearNum, SS1.MonthNum
) As MonthTotal
On MonthTotal.YearNum = SS.YearNum
And MonthTotal.MonthNum = SS.MonthNum
Select GeneralStats.*, Percentiles.Percentile45, Percentiles.Percentile55
From (
Select Year(S1.[Date]) As YearNum
, Month(S1.[Date]) As MonthNum
, Min(S1.Score) As MinScore
, Max(S1.Score) As MaxScore
, Avg(S1.Score) As AvgScore
From Scores As S1
Group By Month(S1.[Date]), Year(S1.[Date])
) As GeneralStats
Join (
Select SS1.YearNum, SS1.MonthNum
, Min(Case When SS1.Percentile >= 45 Then Score End) As Percentile45
, Min(Case When SS1.Percentile >= 55 Then Score End) As Percentile55
From #SummaryStatistics As SS1
Group By SS1.YearNum, SS1.MonthNum
) As Percentiles
On Percentiles.YearNum = GeneralStats.YearNum
And Percentiles.MonthNum = GeneralStats.MonthNum
这篇关于按月按组获取数据集的百分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!