使用Spark Groupby聚合查找最常见的值和对应的计数 [英] Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates

查看:523
本文介绍了使用Spark Groupby聚合查找最常见的值和对应的计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Spark(Scala)数据帧对模式和相应的计数进行分组汇总。

I am trying to use Spark (Scala) dataframes to do groupby aggregates for mode and the corresponding count.

例如,

假设我们有以下数据框:

Suppose we have the following dataframe:

Category   Color   Number   Letter      
1        Red         4        A
1        Yellow      Null     B
3        Green       8        C
2        Blue        Null     A
1        Green       9        A
3        Green       8        B
3        Yellow      Null     C
2        Blue        9        B
3        Blue        8        B
1        Blue        Null     Null
1        Red         7        C
2        Green       Null     C
1        Yellow      7        Null
3        Red         Null     B

现在我们要按类别分组,然后按颜色分组,然后找到分组,非空数的计数,数的总大小,数的均值,n的众数编号,以及相应的模式计数。对于字母,我想要非null的计数以及相应的mode和mode计数(无意,因为这是一个字符串)。

Now we want to group by Category, then Color, and then find the size of the grouping, count of number non-nulls, the total size of number, the mean of number, the mode of number, and the corresponding mode count. For letter I'd like the count of non-nulls and the corresponding mode and mode count (no mean since this is a string).

因此,理想情况下,输出应为:

So the output would ideally be:

Category     Color     CountNumber(Non-Nulls)   Size   MeanNumber  ModeNumber ModeCountNumber   CountLetter(Non-Nulls)  ModeLetter   ModeCountLetter
1            Red       2                        2      5.5         4 (or 7) 
1            Yellow    1                        2      7           7     
1            Green     1                        1      9           9       
1            Blue      1                        1      -           -       
2            Blue      1                        2      9           9      etc 
2            Green     -                        1      -           -       
3            Green     2                        2      8           8       
3            Yellow    -                        1      -           -       
3            Blue      1                        1      8           8       
3            Red       -                        1      -           -       

这对计数很容易,但对于其他所有内容来说却比较棘手。

This is easy to do for the count and mean but more tricky for everything else. Any advice would be appreciated.

谢谢。

推荐答案

到目前为止据我所知-没有简单的计算模式的方法-您必须计算每个值的出现次数,然后将结果与该结果的最大值(每个键)结合起来。其余的计算相当简单:

As far as I know - there's no simple way to compute mode - you have to count the occurrences of each value and then join the result with the maximum (per key) of that result. The rest of the computations are rather straight-forward:

// count occurrences of each number in its category and color
val numberCounts = df.groupBy("Category", "Color", "Number").count().cache()

// compute modes for Number - joining counts with the maximum count per category and color:
val modeNumbers = numberCounts.as("base").join(numberCounts.groupBy("Category", "Color").agg(max("count") as "_max").as("max"),
  $"base.Category" === $"max.Category" and
  $"base.Color" === $"max.Color" and
  $"base.count" === $"max._max")
  .select($"base.Category", $"base.Color", $"base.Number", $"_max")
  .groupBy("Category", "Color")
  .agg(first($"Number", ignoreNulls = true) as "ModeNumber", first("_max") as "ModeCountNumber")
  .where($"ModeNumber".isNotNull)

// now compute Size, Count and Mean (simple) and join to add Mode:
val result = df.groupBy("Category", "Color").agg(
  count("Color") as "Size", // counting a key column -> includes nulls
  count("Number") as "CountNumber", // does not include nulls
  mean("Number") as "MeanNumber"
).join(modeNumbers, Seq("Category", "Color"), "left")

result.show()
// +--------+------+----+-----------+----------+----------+---------------+
// |Category| Color|Size|CountNumber|MeanNumber|ModeNumber|ModeCountNumber|
// +--------+------+----+-----------+----------+----------+---------------+
// |       3|Yellow|   1|          0|      null|      null|           null|
// |       1| Green|   1|          1|       9.0|         9|              1|
// |       1|   Red|   2|          2|       5.5|         7|              1|
// |       2| Green|   1|          0|      null|      null|           null|
// |       3|  Blue|   1|          1|       8.0|         8|              1|
// |       1|Yellow|   2|          1|       7.0|         7|              1|
// |       2|  Blue|   2|          1|       9.0|         9|              1|
// |       3| Green|   2|          2|       8.0|         8|              2|
// |       1|  Blue|   1|          0|      null|      null|           null|
// |       3|   Red|   1|          0|      null|      null|           null|
// +--------+------+----+-----------+----------+----------+---------------+

您可以想象-这可能很慢,因为它有4个 groupBy s和两个 join s-所有都需要改组...

As you can imagine - this might be slow, as it has 4 groupBys and two joins - all requiring shuffles...

关于 Letter 列的统计信息-恐怕您'将不得不对该列分别重复此操作并添加另一个联接。

As for the Letter column statistics - I'm afraid you'll have to repeat this for that column separately and add another join.

这篇关于使用Spark Groupby聚合查找最常见的值和对应的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆