星火+斯卡拉转换,不变性和放大器;内存消耗费用 [英] Spark + Scala transformations, immutability & memory consumption overheads

查看:201
本文介绍了星火+斯卡拉转换,不变性和放大器;内存消耗费用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经经历了一些影片在了针对YouTube 星火架构。

I have gone through some videos in Youtube regarding Spark architecture.

虽然懒惰的评价,在发生故障时的数据创建韧性,良好的功能编程概念,是有原因的Resilenace分布式数据集的成功,一是令人担忧的因素是内存开销由于多次的变换的产生到内存开销,由于数据的不变性。

Even though Lazy evaluation, Resilience of data creation in case of failures, good functional programming concepts are reasons for success of Resilenace Distributed Datasets, one worrying factor is memory overhead due to multiple transformations resulting into memory overheads due data immutability.

如果我理解正确的概念,每一个变革正在创造新的数据集,因此对内存的要求将这些很多次了。如果我在code采用10变换,将创建10套数据集,我的内存消耗将增加10倍增长。

If I understand the concept correctly, Every transformations is creating new data sets and hence the memory requirements will gone by those many times. If I use 10 transformations in my code, 10 sets of data sets will be created and my memory consumption will increase by 10 folds.

例如

val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

上面的例子中有三个转变: flatMap,地图和reduceByKey 。难道这意味着我需要的数据的3倍内存数据的X规格?

Above example has three transformations : flatMap, map and reduceByKey. Does it implies I need 3X memory of data for X size of data?

我的理解是否正确?是缓存RDD仅仅是解决这一问题的解决方案?

Is my understanding correct? Is caching RDD is only solution to address this issue?

一旦我开始高速缓存,它可能会波及到磁盘由于大尺寸和性能会因磁盘IO操作受到影响。在这种情况下,Hadoop和星火的性能有可比性?

编辑:

从答案和评论,我已经明白延迟初始化和管道的过程。我的3 X存储器假设,其中X为初始RDD大小是不准确的。

From the answer and comments, I have understood lazy initialization and pipeline process. My assumption of 3 X memory where X is initial RDD size is not accurate.

但有可能缓存1×RDD在内存中,更新它在pipleline?如何缓存()的作品?

But is it possible to cache 1 X RDD in memory and update it over the pipleline? How does cache () works?

推荐答案

首先,懒惰的执行意味着可以发生功能组成:

First off, the lazy execution means that functional composition can occur:

scala> val rdd = sc.makeRDD(List("This is a test", "This is another test", 
                                 "And yet another test"), 1)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at makeRDD at <console>:27

scala> val counts = rdd.flatMap(line => {println(line);line.split(" ")}).
     | map(word => {println(word);(word,1)}).
     | reduceByKey((x,y) => {println(s"$x+$y");x+y}).
     | collect
This is a test
This
is
a
test
This is another test
This
1+1
is
1+1
another
test
1+1
And yet another test
And
yet
another
1+1
test
2+1
counts: Array[(String, Int)] = Array((And,1), (is,2), (another,2), (a,1), (This,2), (yet,1), (test,3))

首先说明,我强迫并行下降到1,这样我们可以看到这看起来一个工人。然后,我添加的println 到每个变换,使我们可以看到如何工作流程移动。你看到它处理的行,然后它处理线的输出,随后的减少。因此,有没有存储每个转化为你的建议独立的国家。相反,每个数据片是通过直至需要一个洗牌整个转换循环,这可以通过从UI DAG的可视化可以看出:

First note that I force the parallelism down to 1 so that we can see how this looks on a single worker. Then I add a println to each of the transformations so that we can see how the workflow moves. You see that it processes the line, then it processes the output of that line, followed by the reduction. So, there are not separate states stored for each transformation as you suggested. Instead, each piece of data is looped through the entire transformation up until a shuffle is needed, as can be seen by the DAG visualization from the UI:

这是懒惰的胜利。至于星火v Hadoop的,已经有很多在那里(它只是谷歌),但要点是,星火倾向于利用网络带宽开箱即用,给它右边有一个提升。然后,有一些被惰性所获得的性能提升,特别是如果一个模式是已知的,您可以利用DataFrames API。

That is the win from the laziness. As to Spark v Hadoop, there is already a lot out there (just google it), but the gist is that Spark tends to utilize network bandwidth out of the box, giving it a boost right there. Then, there a number of performance improvements gained by laziness, especially if a schema is known and you can utilize the DataFrames API.

因此​​,总体而言,星火击败MR手了在几乎每一个方面。

So, overall, Spark beats MR hands down in just about every regard.

这篇关于星火+斯卡拉转换,不变性和放大器;内存消耗费用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆