火花流按自定义功能分组 [英] Spark streaming group by custom function

查看:48
本文介绍了火花流按自定义功能分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有类似下面的输入行

t1, file1, 1, 1, 1
t1, file1, 1, 2, 3
t1, file2, 2, 2, 2, 2
t2, file1, 5, 5, 5
t2, file2, 1, 1, 2, 2

我想实现类似以下行的输出,即对应数字的垂直加法.

and i want to achieve the output like below rows which is a vertical addition of the corresponding numbers.

file1 : [ 1+1+5, 1+2+5, 1+3+5 ]
file2 : [ 2+1, 2+1, 2+2, 2+2 ]

我处于Spark Streaming上下文中,并且在尝试找出按文件名进行聚合的方式时遇到了困难.

I am in a spark streaming context and i am having a hard time trying to figure out the way to aggregate by file name.

似乎我将需要使用以下类似内容,我不确定如何获取正确的语法.任何输入都会有帮助.

It seems like i will need to use something like below, i am not sure how to get to the correct syntax. Any inputs will be helpful.

myDStream.foreachRDD(rdd => rdd.groupBy()) myDStream.foreachRDD(rdd => rdd.aggregate())

我知道如何对给定数字的数组进行垂直求和,但是我不确定如何将该函数提供给聚合器.

I know how to do the vertical sum of array of given numbers, but i am not sure how to feed that function to the aggregator.

def compute_counters(counts : ArrayBuffer[List[Int]]) = {
  counts.toList.transpose.map(_.sum)
}

推荐答案

首先,您需要从逗号分隔的字符串中提取相关的键和值,对其进行解析,然后创建一个包含该键和其列表的元组.使用 的整数InputDStream.map .然后,使用 PairRDDFunctions.reduceByKey 来应用每个密钥的总和:

First, you need to extract the relevant key and values from the comma separated string, parse them, and create a tuple which contains the key, and the list of integers using InputDStream.map. Then, use PairRDDFunctions.reduceByKey to apply the sum per key:

dStream
.map(line => {
  val splitLines = line.split(", ")
  (splitLines(1), splitLines.slice(2, splitLines.length).map(_.toInt))
})
.reduceByKey((first, second) => (first._1, Array(first._2.sum + second._2.sum))
.foreachRDD((key, sum) => println(s"Key: $key, sum: ${sum.head}")

reduce将产生一个(String,Array [Int])的元组,其中字符串包含id(是"test1"还是"test2"),以及一个包含单个数组的数组值,其中包含每个键的总和.

The reduce will yield a tuple of (String, Array[Int]), where the string contains the id (be it "test1" or "test2"), and an array with a single value, containing the sum per key.

这篇关于火花流按自定义功能分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆