按键触发多个rdd项 [英] spark group multiple rdd items by key

查看：98 发布时间：2020/9/4 7:06:43 scala apache-spark

本文介绍了按键触发多个rdd项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有类似rdd的物品:

I have rdd items like:

(3922774869,10,1)
(3922774869,11,1)
(3922774869,12,2)
(3922774869,13,2)
(1779744180,10,1)
(1779744180,11,1)
(3922774869,14,3)
(3922774869,15,2)
(1779744180,16,1)
(3922774869,12,1)
(3922774869,13,1)
(1779744180,14,1)
(1779744180,15,1)
(1779744180,16,1)
(3922774869,14,2)
(3922774869,15,1)
(1779744180,16,1)
(1779744180,17,1)
(3922774869,16,4)
...

代表(id, age, count)，我想对这些行进行分组以生成一个数据集，针对该数据集，每行代表每个id的年龄分布，如下所示((id, age)是uniq):

which represent (id, age, count) and I want to group those lines to generate a dataset for which each line represent the distribution of age of each id like this((id, age) is uniq):

(1779744180, (10,1), (11,1), (12,2), (13,2) ...)
(3922774869, (10,1), (11,1), (12,3), (13,4) ...)

是(id, (age, count), (age, count) ...)

有人可以给我一个线索吗?

Could some one give me a clue?

推荐答案

您可以先按两个字段进行归约，然后使用groupBy:

You can first reduce by both fields, then use groupBy:

rdd
  .map { case (id, age, count) => ((id, age), count) }.reduceByKey(_ + _)
  .map { case ((id, age), count) => (id, (age, count)) }.groupByKey()

哪个返回RDD[(Long, Iterable[(Int, Int)])]，对于上面的输入，它将包含以下两个记录:

Which returns an RDD[(Long, Iterable[(Int, Int)])], for the input above it would contain these two records:

(1779744180,CompactBuffer((16,3), (15,1), (14,1), (11,1), (10,1), (17,1)))
(3922774869,CompactBuffer((11,1), (12,3), (16,4), (13,3), (15,3), (10,1), (14,5)))

这篇关于按键触发多个rdd项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

按键触发多个rdd项 [英] spark group multiple rdd items by key

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

按键触发多个rdd项 [英] spark group multiple rdd items by key

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭