如何根据索引存取元件的火花RDD阵列 [英] How to access Spark RDD Array of elements based on index

查看:201
本文介绍了如何根据索引存取元件的火花RDD阵列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个RDD中包含元素的数组,每个元素都可以被视为元组,现在的问题是我想通过这个RDD

这是第一次两个元..和循环只访问第4单元

 数组[(智力,字符串,字符串,整数)] =阵列(
    (1,汤姆,AAA,2000年),(2,添,AAA,3000)
    (3,马克,BBB,6000),(4,吉姆,BBB,6000),(5,詹姆斯,CCC,4000))

我想先tuple1 4元(2000年)和第四tuple2元件(3000)
运行一些条件,然后做同样的,但现在的2元组和元组3..basically遍历RDD ..

我可以写一个for循环,如果在Scala中声明,但我不认识谁做对RDD的顶部,因为RDD不允许参数。

感谢和AP preciate任何帮助。我是新来的火花让还在学习。


解决方案

  

如何访问星火基于索引的元素RDD阵列


答案很简单,不要尝试。 RDDS未编入索引,并根据值的情况下顺序可能是不确定的。

据我了解你想要的是一个简单的地图和滑动窗口:

 进口org.apache.spark.mllib.rdd.RDDFunctions._//一个虚拟函数
高清DoSomething的(XS:数组[INT])= XS匹配{
  案阵列(X1,X2)=>如果(X1< = X)X1 X2其他
}VAL RDD = sc.parallelize(阵列(
    (1,汤姆,AAA,2000年),
    (2,添,AAA,3000),
    (3,马克,BBB,6000),
    (4,吉姆,BBB,6000),
    (5,詹姆斯,CCC,4000)))rdd.map(_._ 4).sliding(2).MAP(DoSomething的)

以上当然假定值的顺序被定义或者换句话说祖先谱系不包括混洗RDDS。

I have an RDD with Array of elements like below, each element can be treated as tuple, Now question is i want to access only 4th element from first two tuples.. and loop through this RDD

Array[(Int, String, String, Int)] = Array(
    (1,Tom,AAA,2000), (2,Tim,AAA,3000),
    (3,Mark,BBB,6000), (4,Jim,BBB,6000), (5,James,CCC,4000))

I want to first take tuple1 4th element (2000) and tuple2 4th element (3000) run some condition and then do the same but now for tuple 2 and tuple 3..basically loop through the RDD..

I can write a for loop and if statement in Scala but I don't understanding who to do it on top of RDD since RDD doesn't allow parameters.

Thanks and appreciate any help. I am new to spark so still learning.

解决方案

How to access Spark RDD Array of elements based on index

The answer is simply don't try. RDDs are not indexed, and depending on a context order of values can be nondeterministic.

As far as I understand what you want is simply a map and sliding window:

import org.apache.spark.mllib.rdd.RDDFunctions._

// A dummy function
def doSomething(xs: Array[Int]) = xs match {
  case Array(x1, x2) => if (x1 <= x2) x1 else x2
}

val rdd = sc.parallelize(Array(
    (1, "Tom", "AAA", 2000),
    (2, "Tim", "AAA", 3000),
    (3, "Mark", "BBB", 6000),
    (4, "Jim", "BBB", 6000),
    (5, "James", "CCC", 4000)))

rdd.map(_._4).sliding(2).map(doSomething)

Above of course assumes that the order of values is defined or in other words ancestor lineage doesn't include shuffled RDDs.

这篇关于如何根据索引存取元件的火花RDD阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆