在SQL Server中计算中位数的函数 [英] Function to Calculate Median in SQL Server

查看:866
本文介绍了在SQL Server中计算中位数的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 MSDN ,Median不能用作聚合函数在Transact-SQL中.但是,我想找出是否可以创建此功能(使用创建Aggregate 函数,用户定义的函数或其他方法).

According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (using the Create Aggregate function, user defined function, or some other method).

做到这一点的最佳方法是什么(如果可能的话)-允许在聚合查询中计算中间值(假设数字数据类型)?

What would be the best way (if possible) to do this - allow for the calculation of a median value (assuming a numeric data type) in an aggregate query?

推荐答案

2019更新: 在我撰写此答案的10年中,发现了更多解决方案,可能会产生更好的结果.此外,此后的SQL Server版本(尤其是SQL 2012)引入了新的T-SQL功能,可用于计算中位数. SQL Server版本还改进了其查询优化器,这可能会影响各种中位数解决方案的性能.网络,我最初的2009年帖子仍然可以,但是对于现代SQL Server应用程序可能会有更好的解决方案.请参阅2012年的这篇文章,该文章是非常有用的资源: https://sqlperformance .com/2012/08/t-sql-queries/median

本文发现,至少在他们测试的简单模式上,以下模式比所有其他替代方法快得多,而且速度要快得多.该解决方案比测试的最慢(PERCENTILE_CONT)解决方案快373倍(!!!).请注意,此技巧需要两个单独的查询,这些查询可能并非在所有情况下都可行.它还需要SQL 2012或更高版本.

This article found the following pattern to be much, much faster than all other alternatives, at least on the simple schema they tested. This solution was 373x faster (!!!) than the slowest (PERCENTILE_CONT) solution tested. Note that this trick requires two separate queries which may not be practical in all cases. It also requires SQL 2012 or later.

DECLARE @c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);

SELECT AVG(1.0 * val)
FROM (
    SELECT val FROM dbo.EvenRows
     ORDER BY val
     OFFSET (@c - 1) / 2 ROWS
     FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY
) AS x;

当然,仅因为在2012年对一种架构进行的一项测试取得了很好的结果,所以您的工作量可能会有所不同,尤其是在使用SQL Server 2014或更高版本时.如果性能对于中位数计算很重要,我强烈建议尝试并性能测试该文章中建议的几个选项,以确保找到最适合您的模式的选项.

Of course, just because one test on one schema in 2012 yielded great results, your mileage may vary, especially if you're on SQL Server 2014 or later. If perf is important for your median calculation, I'd strongly suggest trying and perf-testing several of the options recommended in that article to make sure that you've found the best one for your schema.

使用(SQL Server 2012中的新增功能)函数其他答案之一推荐的之所以会提出这个问题,是因为上面链接的文章发现此内置功能比最快的解决方案慢373倍.此差异有可能在7年后得到改善,但是我个人不会在大桌子上使用此功能,直到我验证了其性能与其他解决方案的对比.

I'd also be especially careful using the (new in SQL Server 2012) function PERCENTILE_CONT that's recommended in one of the other answers to this question, because the article linked above found this built-in function to be 373x slower than the fastest solution. It's possible that this disparity has been improved in the 7 years since, but personally I wouldn't use this function on a large table until I verified its performance vs. other solutions.

2009年原始帖子如下:

ORIGINAL 2009 POST IS BELOW:

有很多方法可以执行此操作,而性能却大不相同.这是一个特别优化的解决方案,来自 中位数,ROW_NUMBER和效果 .对于执行过程中生成的实际I/O而言,这是一个特别理想的解决方案-看起来比其他解决方案要贵得多,但实际上要快得多.

There are lots of ways to do this, with dramatically varying performance. Here's one particularly well-optimized solution, from Medians, ROW_NUMBERs, and performance. This is a particularly optimal solution when it comes to actual I/Os generated during execution – it looks more costly than other solutions, but it is actually much faster.

该页面还包含其他解决方案和性能测试详细信息的讨论.请注意,如果有多行中位数列的值相同,请使用唯一列作为歧义消除器.

That page also contains a discussion of other solutions and performance testing details. Note the use of a unique column as a disambiguator in case there are multiple rows with the same value of the median column.

与所有数据库性能方案一样,始终尝试使用真实硬件上的真实数据测试解决方案–您永远都不知道何时更改SQL Server优化器或环境的特殊性会使正常快速的解决方案变慢. /p>

As with all database performance scenarios, always try to test a solution out with real data on real hardware – you never know when a change to SQL Server's optimizer or a peculiarity in your environment will make a normally-speedy solution slower.

SELECT
   CustomerId,
   AVG(TotalDue)
FROM
(
   SELECT
      CustomerId,
      TotalDue,
      -- SalesOrderId in the ORDER BY is a disambiguator to break ties
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
   FROM Sales.SalesOrderHeader SOH
) x
WHERE
   RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;

这篇关于在SQL Server中计算中位数的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆