火花流按自定义功能分组 [英] Spark streaming group by custom function
问题描述
我有类似下面的输入行
t1, file1, 1, 1, 1
t1, file1, 1, 2, 3
t1, file2, 2, 2, 2, 2
t2, file1, 5, 5, 5
t2, file2, 1, 1, 2, 2
我想实现类似以下行的输出,即对应数字的垂直加法.
and i want to achieve the output like below rows which is a vertical addition of the corresponding numbers.
file1 : [ 1+1+5, 1+2+5, 1+3+5 ]
file2 : [ 2+1, 2+1, 2+2, 2+2 ]
我处于Spark Streaming上下文中,并且在尝试找出按文件名进行聚合的方式时遇到了困难.
I am in a spark streaming context and i am having a hard time trying to figure out the way to aggregate by file name.
似乎我将需要使用以下类似内容,我不确定如何获取正确的语法.任何输入都会有帮助.
It seems like i will need to use something like below, i am not sure how to get to the correct syntax. Any inputs will be helpful.
myDStream.foreachRDD(rdd => rdd.groupBy())
或 myDStream.foreachRDD(rdd => rdd.aggregate())
我知道如何对给定数字的数组进行垂直求和,但是我不确定如何将该函数提供给聚合器.
I know how to do the vertical sum of array of given numbers, but i am not sure how to feed that function to the aggregator.
def compute_counters(counts : ArrayBuffer[List[Int]]) = {
counts.toList.transpose.map(_.sum)
}
推荐答案
首先,您需要从逗号分隔的字符串中提取相关的键和值,对其进行解析,然后创建一个包含该键和其列表的元组.使用 的整数InputDStream.map
.然后,使用 PairRDDFunctions.reduceByKey
来应用每个密钥的总和:
First, you need to extract the relevant key and values from the comma separated string, parse them, and create a tuple which contains the key, and the list of integers using InputDStream.map
. Then, use PairRDDFunctions.reduceByKey
to apply the sum per key:
dStream
.map(line => {
val splitLines = line.split(", ")
(splitLines(1), splitLines.slice(2, splitLines.length).map(_.toInt))
})
.reduceByKey((first, second) => (first._1, Array(first._2.sum + second._2.sum))
.foreachRDD((key, sum) => println(s"Key: $key, sum: ${sum.head}")
reduce将产生一个(String,Array [Int])
的元组,其中字符串包含id(是"test1"还是"test2"),以及一个包含单个数组的数组值,其中包含每个键的总和.
The reduce will yield a tuple of (String, Array[Int])
, where the string contains the id (be it "test1" or "test2"), and an array with a single value, containing the sum per key.
这篇关于火花流按自定义功能分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!