性能VS在星火可读性 - 中间计算 [英] Performance vs readability in Spark -- Intermediate calculations

查看:178
本文介绍了性能VS在星火可读性 - 中间计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有关性能VS可读性开放赏金让我怀疑:是否有任何区别:

The open bounty about performance vs readability made me wonder: Is there any difference between:

val df: DataFrame = ...
val final = df.filter(...).select(...)

val df: DataFrame = ...
val filtered = df.filter(...)
val final = filtered.select(...)

在性能方面?同样适用于 RDD - 如果我做 RDD.filter 并分配结果到 VAL 然后执行 RDD.map ,是否中间有分配成本?

in terms of performance? Same goes for RDD -- if I do RDD.filter and assign the result to a val and then do RDD.map, does that middle assignment have a cost?

我不知道为什么,但我有这个模糊的信念,这会导致不必要的开销。

I'm not sure why, but I have this vague belief that these would cause unnecessary overhead.

都市神话或证实?

推荐答案

唯一的开销我能想出是创建一个 VAL 它指向一个 RDD 上的驱动程序。不知道需要斯卡拉多少毫秒; - )

The only overhead I can come up with is creating a val that points to an RDD on the driver. Not sure how many ms that takes Scala ;-)

你不是做星火提交任何工作所有在你的例子没有首先会发生。作业仅在提交行动(即收集()等)。你表现的分配只是声明了一个RDD但不会触发任何动作。

First of all in your example nothing would happen as you are not making Spark submit any jobs. A job is only submitted on actions (i.e. collect() etc.). The assignment you showed just declares an RDD but doesn't trigger any actions.

由于的所有RDDS以上即可,所以你不需要任何计算

Because of the above all RDDs can be lazy so you don't need to compute anything.

当您提交的工作星火将创建一个向无环图(DAG)的任务(过滤器,地图,减少groupByKey等),必须执行以完成工作:-)有两种类型的任务:。宽任务需要一个数据洗牌。如果有任何洗牌需要星火将分你的任务分成的阶段,即第一阶段将有洗牌前的所有任务,第二阶段将有洗牌后的所有任务。

When you submit a job Spark will create a directed acyclic graph (DAG) of the tasks (filter, map, reduce, groupByKey etc.) which have to be performed to get the job done :-) There are two types of tasks: narrow and wide. Wide tasks require a data shuffle. If any shuffle is required Spark will divide your tasks into stages, i.e. first stage would have all tasks before the shuffle, second stage would have all tasks after the shuffle.

由于这个星火,运行任何东西之前,实际上可以调用南瓜像你上面贴到一个单一的阶段,一个将在同一个分区上运行(整个过滤 - >选择链)。

Thanks to this Spark, before running anything, can actually squash invocations like the one you posted above into a single stage that would run on the same partition (the whole filter -> select chain).

val df: DataFrame = ...
val filtered = df.filter(...)
val final = filtered.select(...)
final.collect()

电火花产生,在内部,一个血统最后 RDD并找出了过滤器()选择()可以并行(以及它也会检查缓存),并在同一分区一起运行。在血统在这两种情况下应为收集()

Spark will create, internally, a lineage for the final RDD and figure out that the filter() and select() can be parallelized (well it will also check for caches) and run together on the same partition. The lineage in both cases should be the same for that collect().

有大约星火内部一些很酷的会谈,比如或的这个从星火峰会2014年

There are some cool talks about Spark internals, for instance this or this one from Spark Summit 2014.

这篇关于性能VS在星火可读性 - 中间计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆