SQL:显示标准偏差内的平均值和最小值/最大值 [英] SQL: Show average and min/max within standard deviations

查看:55
本文介绍了SQL:显示标准偏差内的平均值和最小值/最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 SQL 表 -

I have the following SQL table -

Date       StoreNo       Sales
23/4            34     4323.00
23/4            23      564.00
24/4            34     2345.00
etc

我正在运行一个查询,该查询返回特定时期的平均销售额、最大销售额和最小销售额 -

I am running a query that returns average sales, max sales and min sales for a certain period -

select avg(Sales), max(sales), min(sales)
from tbl_sales
where date between etc

但是在最小值和最大值中有一些非常极端的值——也许是因为数据输入错误,也许是因为在那个日期和商店发生了一些异常.

But there are some values coming through in the min and max that are really extreme - perhaps because the data entry was bad, perhaps because some anomoly had occurred on that date and store.

我想要的是一个返回平均值、最大值和最小值的查询,但以某种方式排除了极值.我对如何做到这一点持开放态度,但也许它会以某种方式使用标准偏差(例如,仅使用真实平均值的 x 个标准偏差内的数据).

What I'd like is a query that returns average, max and min, but somehow excludes the extreme values. I am open to how this is done, but perhaps it would use standard deviations in some way (for example, only using data within x std devs of the true average).

非常感谢

推荐答案

为了计算标准差,您需要遍历所有元素,因此不可能在一个查询中完成.懒惰的方法是分两次完成:

In order to calculate the standard deviation, you need to iterate through all of the elements, so it would be impossible to do this in one query. The lazy way would be to just do it in two passes:

DECLARE
    @Avg int,
    @StDev int

SELECT @Avg = AVG(Sales), @StDev = STDEV(Sales)
FROM tbl_sales
WHERE ...

SELECT AVG(Sales) AS AvgSales, MAX(Sales) AS MaxSales, MIN(Sales) AS MinSales
FROM tbl_sales
WHERE ...
AND Sales >= @Avg - @StDev * 3
AND Sales <= @Avg + @StDev * 3

另一个可能可行的简单选项(在科学数据分析中相当常见)是删除最小和最大 x 值,如果您有大量数据需要处理.您可以使用 ROW_NUMBER 在一个语句中执行此操作:

Another simple option that might work (fairly common in analysis of scientific data) would be to just drop the minimum and maximum x values, which works if you have a lot of data to process. You can use ROW_NUMBER to do this in one statement:

WITH OrderedValues AS
(
    SELECT
        Sales,
        ROW_NUMBER() OVER (ORDER BY Sales) AS RowNumAsc,
        ROW_NUMBER() OVER (ORDER BY Sales DESC) AS RowNumDesc
)
SELECT ...
FROM tbl_sales
WHERE ...
AND Sales >
(
    SELECT MAX(Sales)
    FROM OrderedValues
    WHERE RowNumAsc <= @ElementsToDiscard
)
AND Sales <
(
    SELECT MIN(Sales)
    FROM OrderedValues
    WHERE RowNumDesc <= @ElementsToDiscard
)

如果您想丢弃一定数量的唯一值,请将 ROW_NUMBER 替换为 RANKDENSE_RANK.

Replace ROW_NUMBER with RANK or DENSE_RANK if you want to discard a certain number of unique values.

除了这些简单的技巧之外,您还开始了解一些非常重要的统计数据.我必须处理类似类型的验证,对于 SO 帖子来说,它的材料太多了.有一百种不同的算法,您可以用十几种不同的方式进行调整.如果可能,我会尽量保持简单!

Beyond these simple tricks you start to get into some pretty heavy stats. I have to deal with similar kinds of validation and it's far too much material for a SO post. There are a hundred different algorithms that you can tweak in a dozen different ways. I would try to keep it simple if possible!

这篇关于SQL:显示标准偏差内的平均值和最小值/最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆