如何在Java Spark RDD上执行标准偏差和平均操作? [英] How to perform Standard Deviation and Mean operations on a Java Spark RDD?
本文介绍了如何在Java Spark RDD上执行标准偏差和平均操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个看起来像这样的JavaRDD。
I have a JavaRDD which looks like this.,
[
[A,8]
[B,3]
[C,5]
[A,2]
[B,8]
...
...
]
我希望我的结果是
平均值
I want my result to be Mean
[
[A,5]
[B,5.5]
[C,5]
]
如何仅使用Java RDD执行此操作。
PS:我想避免使用groupBy操作,所以我没有使用DataFrames。
How do I do this using Java RDDs only. P.S : I want to avoid groupBy operation so I am not using DataFrames.
推荐答案
在这里:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.util.StatCounter;
import scala.Tuple2;
import scala.Tuple3;
import java.util.Arrays;
import java.util.List;
public class AggregateByKeyStatCounter {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("AggregateByKeyStatCounter").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Tuple2<String, Integer>> myList = Arrays.asList(new Tuple2<>("A", 8), new Tuple2<>("B", 3), new Tuple2<>("C", 5),
new Tuple2<>("A", 2), new Tuple2<>("B", 8));
JavaRDD<Tuple2<String, Integer>> data = sc.parallelize(myList);
JavaPairRDD<String, Integer> pairs = JavaPairRDD.fromJavaRDD(data);
/* I'm actually using aggregateByKey to perform StatCounter
aggregation, so actually you can even have more statistics available */
JavaRDD<Tuple3<String, Double, Double>> output = pairs
.aggregateByKey(
new StatCounter(),
StatCounter::merge,
StatCounter::merge)
.map(x -> new Tuple3<String, Double, Double>(x._1(), x._2().stdev(), x._2().mean()));
output.collect().forEach(System.out::println);
}
}
这篇关于如何在Java Spark RDD上执行标准偏差和平均操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文