如何根据索引存取元件的火花RDD阵列 [英] How to access Spark RDD Array of elements based on index
问题描述
我有一个RDD中包含元素的数组,每个元素都可以被视为元组,现在的问题是我想通过这个RDD
这是第一次两个元..和循环只访问第4单元 数组[(智力,字符串,字符串,整数)] =阵列(
(1,汤姆,AAA,2000年),(2,添,AAA,3000)
(3,马克,BBB,6000),(4,吉姆,BBB,6000),(5,詹姆斯,CCC,4000))
我想先tuple1 4元(2000年)和第四tuple2元件(3000)
运行一些条件,然后做同样的,但现在的2元组和元组3..basically遍历RDD ..
我可以写一个for循环,如果在Scala中声明,但我不认识谁做对RDD的顶部,因为RDD不允许参数。
感谢和AP preciate任何帮助。我是新来的火花让还在学习。
如何访问星火基于索引的元素RDD阵列
块引用>答案很简单,不要尝试。 RDDS未编入索引,并根据值的情况下顺序可能是不确定的。
据我了解你想要的是一个简单的
地图
和滑动窗口:进口org.apache.spark.mllib.rdd.RDDFunctions._//一个虚拟函数
高清DoSomething的(XS:数组[INT])= XS匹配{
案阵列(X1,X2)=>如果(X1< = X)X1 X2其他
}VAL RDD = sc.parallelize(阵列(
(1,汤姆,AAA,2000年),
(2,添,AAA,3000),
(3,马克,BBB,6000),
(4,吉姆,BBB,6000),
(5,詹姆斯,CCC,4000)))rdd.map(_._ 4).sliding(2).MAP(DoSomething的)以上当然假定值的顺序被定义或者换句话说祖先谱系不包括混洗RDDS。
I have an RDD with Array of elements like below, each element can be treated as tuple, Now question is i want to access only 4th element from first two tuples.. and loop through this RDD
Array[(Int, String, String, Int)] = Array( (1,Tom,AAA,2000), (2,Tim,AAA,3000), (3,Mark,BBB,6000), (4,Jim,BBB,6000), (5,James,CCC,4000))
I want to first take tuple1 4th element (2000) and tuple2 4th element (3000) run some condition and then do the same but now for tuple 2 and tuple 3..basically loop through the RDD..
I can write a for loop and if statement in Scala but I don't understanding who to do it on top of RDD since RDD doesn't allow parameters.
Thanks and appreciate any help. I am new to spark so still learning.
解决方案How to access Spark RDD Array of elements based on index
The answer is simply don't try. RDDs are not indexed, and depending on a context order of values can be nondeterministic.
As far as I understand what you want is simply a
map
and sliding window:import org.apache.spark.mllib.rdd.RDDFunctions._ // A dummy function def doSomething(xs: Array[Int]) = xs match { case Array(x1, x2) => if (x1 <= x2) x1 else x2 } val rdd = sc.parallelize(Array( (1, "Tom", "AAA", 2000), (2, "Tim", "AAA", 3000), (3, "Mark", "BBB", 6000), (4, "Jim", "BBB", 6000), (5, "James", "CCC", 4000))) rdd.map(_._4).sliding(2).map(doSomething)
Above of course assumes that the order of values is defined or in other words ancestor lineage doesn't include shuffled RDDs.
这篇关于如何根据索引存取元件的火花RDD阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!