懒惰背景下的变革与行动 [英] Transformation vs Action in the context of Laziness

查看:46
本文介绍了懒惰背景下的变革与行动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如学习火花:闪电般的快速大数据分析"中所述,本书:

As mentioned in the "Learning Spark: Lightning-Fast Big Data Analysis" book:

由于Spark计算RDD的方式,转换和动作有所不同.

Transformations and actions are different because of the way Spark computes RDDs.

在我发现了一些关于懒惰的解释之后,转换和动作都在懒惰地工作.因此,问题是,引用的句子是什么意思?

After some explanation about laziness, as I found, both transformations and actions are working lazily. Therefore, the question is, what does the quoted sentence mean?

推荐答案

对比RDD动作与转换的惰性并不必然有效.

It is not necessarily valid to contrast laziness of RDD actions vs transformations.

正确的说法是,从RDD的角度来看,RDD被懒惰地评估为数据的集合:创建RDD实例时,内存中不一定有数据".

The correct statement would be that RDDs are lazily evaluated, from the perspective of an RDD as a collection of data: there's not necessarily "data" in memory when the RDD instance is created.

此语句引起的问题是:何时将RDD的数据加载到内存中?可以将其改写为何时对RDD进行评估?".在这里,我们可以区分动作和转换:

The question raised by this statement is: when does the RDD's data get loaded in memory? Which can be rephrased as "when does the RDD get evaluated?". It's here that we have the distinction between actions and transformations:

请考虑以下代码序列:

第1行:

rdd = sc.textFile("text-file-path")

RDD是否存在?.
数据是否已加载到内存中?.
-> RDD评估很懒

Does the RDD exist? Yes.
Is the data loaded in memory? No.
--> RDD evaluation is lazy

第2行:

rdd2 = rdd.map(lambda line: list.split())

RDD是否存在?.实际上,有 2个RDD .
数据是否已加载到内存中?.
->仍然很懒,Spark所做的只是记录如何加载数据并进行数据转换,同时记住沿袭(如何从另一个RDD派生RDD).

Does the RDD exist? Yes. In fact, there are 2 RDDs.
Is the data loaded in memory? No.
--> Still, it's lazy, all Spark does is record how to load the data and transform it, remembering the lineage (how to derive RDDs one from another).

第3行

print(rdd2.collect())

RDD是否存在?(尚有2个RDD).
数据是否已加载到内存中?.

Does the RDD exist? Yes (2 RDDs still).
Is the data loaded in memory? Yes.

有什么区别? collect()强制Spark返回转换结果.现在,Spark会执行在步骤#1,#2和#3中记录的所有内容.

What's the difference? collect() forces Spark to return the result of the transformations. Spark now does all that it recorded in steps #1, #2, and #3.

用spark的术语来说,#1和#2是转换.转换通常会返回另一个 RDD 实例,这是识别 lazy 部分的提示.

In spark's terminology, #1 and #2 are transformations. Transformations typically return another RDD instance, and that's a hint for recognizing the lazy part.

#3有一个动作,它简单地意味着要执行转换中的计划以便返回结果或执行最终动作的操作,例如保存结果(是,例如保存加载到内存中的实际数据集合" ).

#3 has an action, which simply means an operation that causes plans in transformations to be carried out in order to return a result or perform a final action, such as saving results (yes, "such as saving the actual collection of data loaded in memory").

因此,简而言之,我想说RDD的计算是惰性的,但是,我认为将操作(动作或转换)标记为惰性还是不正确是不正确的.

So, in short, I'd say that RDDs are lazily evaluated, but, in my opinion, it's incorrect to label operations (actions or transformations) as lazy or not.

这篇关于懒惰背景下的变革与行动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆