按季度和 customer_type 在 mysql 中聚合数据(按中位数) [英] aggregate data (by median) in mysql by quarter and by customer_type

查看:39
本文介绍了按季度和 customer_type 在 mysql 中聚合数据(按中位数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 MySQL,我的 SQL 表如下所示:

I am using MySQL, and My SQL table looks like this:

sales_year (INT), sales_month (INT), sales_day (INT), price (float), customer_type (TEXT)

我想知道哪个 sql_query 将按季度汇总价格数据(计算每个季度的中位数价格,以及用于计算中位数的观察次数),并按客户类型分组.

I'd like to know which sql_query would aggregate price data by quarter (compute the median price for each quarter, and how many observations where used to compute the median), and grouped by customer type.

我正在为两个主要步骤而苦苦挣扎:mySQL 似乎不支持中值,以及如何按季度聚合数据 - 似乎按客户类型分组,一旦解决了这两个问题就很容易了.

I am struggling with two major steps: Median does not seem to be supported by mySQL, and also how to aggregate data by quarter - seems grouping by customer type it's very easy once those two are solved.

STRUGGLE - 计算中位数....

STRUGGLE - Computing the median....

例如,我只是尝试创建一个四分之一列,它可以工作,但它计算的是 AVG 而不是中位数:

I for example just tried creating a quarter column and it works but it computes the AVG instead of the median:

    select avg(price) as avg_price, floor(sales_month/3.0+1) as
    sales_quarter, count(*) as n_transactions, sales_year, customer_type
    from mydb.mytable
    group by sales_quarter, sales_year, customer_type;

这个命令工作得很好.但理想情况下,我可以通过 MEDIAN 更改平均值,但 mySQL 没有这样的支持,关于如何更改此代码以使其适用于中值目的的任何建议?

This command works perfectly fine. But ideally I could change avg by MEDIAN but mySQL does not have such support, any suggestions on how to change this code to make it work for median purposes?

注意:我还尝试从这个 站点 中的用户定义函数安装我自己的中值函数,但是C 代码无法在我的 mac os X 上编译.

Note: I also tried installing my own median function from user defined functions in this site but the C code did not compile on my mac os X.

所以输出看起来像这样:

So the output would look like this:

sales_quarter (INT)
sales_year (INT)
median_price (FLOAT)
number_users_used_to_compute_median (INT)
customer_type (TEXT)

推荐答案

哦,把平均值叫做中位数.与您交谈的人通常不会知道其中的区别 (;).

Oh, just call the average the median. People you talk to usually won't know the difference (;).

好吧,说真的,你可以在 MySQL 中做到这一点.有一种使用 group_concat()substring_index() 的方法,但这会导致中间字符串值溢出的风险.相反,枚举值并进行简单的算术运算.为此,您需要一个枚举和一个总数.枚举是:

Okay, seriously, you can do this in MySQL. There is a method using group_concat() and substring_index(), but that runs the risk of overflowing the intermediate string values. Instead, enumerate the values and do simple arithmetic. For this, you need an enumeration and a total. The enumeration is:

  select t.*,
         @rn := if(@q = quarter and @y = @year and @ct = customer_type,
                   @rn + 1,
                   if(@q := quarter, if(@y := @year, if(@ct := customer_type, 1, 1), 1), 1)
                  ) as rn
  from mydb.mytable t cross join
       (select @q := '', @y := '', @ct := '', @rn := 0) vars
  order by sales_quarter, sales_year, customer_type, price;

这是精心制定的.order by 列对应于定义的变量.select 中只有一个语句可以为变量赋值.嵌套的 if() 语句确保每个变量都被设置(使用 andor 可能会导致短路).重要的是要记住,MySQL 不保证 select 中表达式的计算顺序,因此只有一个语句集变量对于确保正确性很重要.

This is carefully formulated. The order by columns correspond to the variables defined. There is only one statement that assigns variables in the select. The nested if() statements ensure that each variable gets set (using an and or or could result in short-circuiting). It is important to remember that MySQL does not guarantee the order of evaluation for expressions in the select, so having only one statement set variables is important to ensure correctness.

现在,获得中位数非常容易.您需要总计数、顺序值 (rn) 和一些算术来处理有偶数个值的情况:

Now, getting the median is pretty easy. You need the total count, the sequential value (rn) and some arithmetic to handle the case where there are an even number of values:

select trn.sales_quarter, trn.sales_year, trn.customer_type, avg(price) as median
from (select t.*,
             @rn := if(@q = quarter and @y = @year and @ct = customer_type,
                       @rn + 1,
                       if(@q := quarter, if(@y := @year, if(@ct := customer_type, 1, 1), 1), 1)
                      ) as rn
      from mydb.mytable t cross join
           (select @q := '', @y := '', @ct := '', @rn := 0) vars
      order by sales_quarter, sales_year, customer_type, price
     ) trn join
     (select sales_quarter, sales_year, customer_type, count(*) as numrows
      from mydb.mytable t
      group by sales_quarter, sales_year, customer_type
     ) s
     on trn.sales_quarter = s.sales_quarter and
        trn.sales_year = s.sales_year and
        trn.customer_type = s.customer_type
where 2*rn in (numrows, numrows - 1, numrows + 1)
group by trn.sales_quarter, trn.sales_year, trn.customer_type;

只是强调最后的平均值不是进行平均值计算.它正在计算中位数.正常的定义是,对于偶数个值,中位数是中间两个值的平均值.where 子句处理偶数和奇数情况.

Just to emphasize that the final average is not doing an average calculation. It is calculating the median. The normal definition is that for an even number of values, the median is the average of the two in the middle. The where clause handles both the even and odd cases.

这篇关于按季度和 customer_type 在 mysql 中聚合数据(按中位数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆