Spark-按键分组,然后按值计数 [英] Spark - Group by Key then Count by Value

查看:689
本文介绍了Spark-按键分组,然后按值计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有使用RDD Array[String]

val kvPairs = myRdd.map(line => (line(0), line(1)))

这将产生以下格式的数据:

This produces data of format:

1, A
1, A
1, B
2, C

我想按其值对所有键进行分组,并提供这些值的计数,如下所示:

I would like to group all of they keys by their values and provide the counts for these values like so:

1, {(A, 2), (B, 1)}
2, {(C, 1)}

我尝试了许多不同的尝试,但是我能得到的最接近的是这样的东西:

I have tried many different attempts, but the closest I can get is with something like this:

kvPairs.sortByKey().countByValue()

这给

1, (A, 2)
1, (B, 1)
2, (C, 1)

kvPairs.groupByKey().sortByKey()

提供价值,但还不足够:

Provides value, but it still isn't quite there:

1, {(A, A, B)}
2, {(C)}

我尝试将两者结合在一起:

I tried combining the two together:

kvPairs.countByValue().groupByKey().sortByKey()

但这会返回错误

错误:值groupByKey不是scala.collection.Map [(String,String),Long]的成员

error: value groupByKey is not a member of scala.collection.Map[(String, String),Long]

推荐答案

只需直接计算对数,然后再对分组(如果需要):

Just count pairs directly and group (if you have to) afterwards:

kvPairs.map((_, 1L))
  .reduceByKey(_ + _)
  .map{ case ((k, v), cnt) => (k, (v, cnt)) }
  .groupByKey

如果要在精简后减小gropuByKey,则可能要使用自定义分区程序,该程序仅考虑键的第一个元素.您可以检查 RDD拆分并在新的RDD上进行聚合以获取示例实现.

If you want to gropuByKey after reducing you may want to use custom partitioner which considers only the first element of the key. You can check RDD split and do aggregation on new RDDs for an example implementation.

这篇关于Spark-按键分组,然后按值计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆