在Java Spark中重新分区后如何查找每个分区中的项目 [英] How to find out items in each partition after repartition in Java Spark
问题描述
我有一个Java ArrayList,它的Integer值很少.我已经使用ArrayList创建了一个DataSet.我使用了 System.out.println(DF.javaRDD().getNumPartitions()); ,它导致了1个分区.我想将数据分为3个分区.所以我用了repartition().我想找出重新分区后每个分区中的项目数.
I have a Java ArrayList with few Integer values. I have created a DataSet with the ArrayList. I used System.out.println(DF.javaRDD().getNumPartitions()); and it resulted in 1 partition. I wanted to divide the data into 3 partitions. so I used repartition(). I want to find out the number of items in each partition after repartition.
在scala中,它是直截了当的.
In scala it is straight forward.
DF.repartition(3).mapPartitions((it) => Iterator(it.length));
但是相同的语法在Java中不起作用,因为length函数在Java的Iterator Interface中不可用.
But the same syntax is not working in Java since the length function is not available in Iterator Interface in Java.
我们应该如何解释mappartition函数?
mapPartitions(FlatMapFunction<java.util.Iterator<T>,U> f)
内部函数将采用哪些参数,其返回类型是什么?
SparkSession sessn = SparkSession.builder().appName("RDD to DF").master("local").getOrCreate();
List<Integer> lst = Arrays.asList(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
Dataset<Integer> DF = sessn.createDataset(lst, Encoders.INT());
System.out.println(DF.javaRDD().getNumPartitions());
推荐答案
尝试一下-
List<Integer> lst = Arrays.asList(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
Dataset<Integer> DF = spark.createDataset(lst, Encoders.INT());
System.out.println(DF.javaRDD().getNumPartitions());
MapPartitionsFunction<Integer, Integer> f =
it -> ImmutableList.of(JavaConverters.asScalaIteratorConverter(it).asScala().length()).iterator();
DF.repartition(3).mapPartitions(f,
Encoders.INT()).show(false);
/**
* 2
* +-----+
* |value|
* +-----+
* |6 |
* |8 |
* |6 |
* +-----+
*/
这篇关于在Java Spark中重新分区后如何查找每个分区中的项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!