阿帕奇星火方法返回一个RDD(带尾递归) [英] Apache Spark Method returning an RDD (with Tail Recursion)

查看:200
本文介绍了阿帕奇星火方法返回一个RDD(带尾递归)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是RDD具有谱系,因此不存在,直到如果在其上执行的动作;所以,如果我有对RDD执行大量的转换,并返回一个转换RDD那么什么我实际上返回的方法?
难道我没有回,直到RDD所需的行动?如果我的方法缓存的RDD,它在缓存中坚持?我想我知道这个问题的答案之中:当动作称为其上返回的RDD法将只运行?但我可能是错的。

An RDD has a lineage and therefore does not exist until an action if performed on it; so, if I have a method which performs numerous transformations on the RDD and returns a transformed RDD then what am I actually returning? Am I returning nothing until that RDD is required for an action? If I cached an RDD in the method, does it persist in the cache? I think I know the answer to this being: the method will only be run when the action is called on the RDD which is returned? But I could be wrong.

对于这个问题的扩展是:
如果我有需要一个RDD作为参数,并返回一个RDD但我缓存RDD的方法中的一个尾递归方法:

An extension to this question is: If I have a tail recursive method that takes an RDD as a parameter and returns an RDD but I am caching RDD's within the method:

def method(myRDD : RDD) : RDD = {
   ...
   anRDD.cache
   if(true) return someRDD
   method(someRDD) // tailrec
}

然后,当一个尾递归发生时,它覆盖previous缓存RDD anRDD 或两者都做坚持?我想像都存在。我有数据溢出到磁盘时,我使用的数据仅仅是63MB大。我想这可能是与尾递归方法。

Then, when a tail recursion happens, does it overwrite the previous cached RDD anRDD or do both persist? I'd imagine both persist. I am having data spilled to disk when the dataset I'm using is just 63mb big. And I think it could have something to do with the tail recursive method.

推荐答案

该RDD血统建成连接在一起RDD对象实例的图,其中在沿袭每个节点都有其依赖的参考。在它的最简单的链形式,你可以看到它作为一个链表:

The RDD lineage is built as a graph of RDD object instances linked together where every node in the lineage has a reference to its dependencies. In it's most simple chain form, you could see it as a linked list:

hadoopRDD(location) <-depends- filteredRDD(f:A->Boolean) <-depends- mappedRDD(f:A->B)

您可以AP preciate在此基础RDD构造:

You can appreciate this in base RDD constructor:

/** Construct an RDD with just a one-to-one dependency on one parent */
  def this(@transient oneParent: RDD[_]) =
    this(oneParent.context , List(new OneToOneDependency(oneParent))) 

要开门见山:在我们可以递归构建一个链表同样的方法,我们还可以建立一个RDD血统。作用于RDDS递归函数的结果将是一个良好定义的RDD

To come to the point: In the same way we can recursively build a linked list, we can also build an RDD lineage. The result of the recursive function that acts on RDDs will be a well-defined RDD.

这是行动将需要安排的血统执行,并会兑现它所psented计算重新$ P $,就像人们可以走一个链表一旦被创建。

An action will be required to schedule that lineage for execution, and will materialize the computation represented by it, much like one could "walk" a linked list once it has been created.

考虑这个(而contrieved,我必须承认)例如:

Consider this (rather contrieved, I must admit) example:

def isPrime(n:Int):Boolean = {
    (n == 2) || (!( n % 2 ==0) && !((3 to math.sqrt(n).ceil.toInt) exists (x => n % x == 0)))
}

def recPrimeFilter(rdd:RDD[Int], i:Int):RDD[Int] = 
if (i<=1) rdd else if (isPrime(i)) recPrimeFilter(rdd.filter(x=> x!=i), i-1) else (recPrimeFilter(rdd.map(x=>x+i), i-1))

当应用到整数的RDD,我们可以用交错过滤器和地图产生的素数位置观察血统:

When applied to an RDD of ints, we can observe the lineage with the interleaved filter and map resulting of the prime number locations :

val rdd = sc.parallelize(1 to 100)
val res = weirdPrimeFilter(rdd,15)
scala> res.toDebugString
res3: String = 
(8) FilteredRDD[54] at filter at <console>:18 []
 |  FilteredRDD[53] at filter at <console>:18 []
 |  MappedRDD[52] at map at <console>:18 []
 |  FilteredRDD[51] at filter at <console>:18 []
 |  MappedRDD[50] at map at <console>:18 []
 |  FilteredRDD[49] at filter at <console>:18 []
 |  MappedRDD[48] at map at <console>:18 []
 |  MappedRDD[47] at map at <console>:18 []
 |  MappedRDD[46] at map at <console>:18 []
 |  FilteredRDD[45] at filter at <console>:18 []
 |  MappedRDD[44] at map at <console>:18 []
 |  FilteredRDD[43] at filter at <console>:18 []
 |  MappedRDD[42] at map at <console>:18 []
 |  MappedRDD[41] at map at <console>:18 []
 |  ParallelCollectionRDD[33] at parallelize at <console>:13 []

缓存打破了沿袭,使得RDD在缓存记住它的内容。它经过那里,sothat所有依赖RDDS进一步向上的血统可以重复使用缓存数据在第一时间点。
在直线RDD血统的基本情况下,将没有任何效果可言,因为每个节点将只访问一次。

'cache' breaks the lineage, making the RDD at the point of caching to "remember" its contents the first time it passes by there, sothat all dependent RDDs further up in the lineage can reuse that cached data. In the basic case of the linear RDD lineage, it will have no effect at all, because each node will be visited only once.

缓存,在这种情况下,可以如递归RDD施工过程中创建,其中动作称为在许多不同的叶节点图或树状结构是有意义的。

Caching, in this case, could make sense if the recursive RDD construction process creates a graph or tree-like structure where actions are called at many different 'leaf' nodes.

这篇关于阿帕奇星火方法返回一个RDD(带尾递归)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆