RDD.foreach()和RDD.map()之间的区别 [英] Difference between RDD.foreach() and RDD.map()
问题描述
我正在用Python学习Spark,想知道有人能解释操作foreach()
和转换map()
之间的区别吗?
I am learning Spark in Python and wondering can anyone explain the difference between the action foreach()
and transformation map()
?
rdd.map()
返回一个新的RDD,就像Python中的原始map函数一样.但是,我想查看rdd.foreach()
函数并了解它们之间的区别.谢谢!
rdd.map()
returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach()
function and understand the differences. Thanks!
推荐答案
一个非常简单的示例是rdd.foreach(print)
,它将打印RDD中每一行的值,但不以任何方式修改RDD.
A very simple example would be rdd.foreach(print)
which would print the value of each row in the RDD but not modify the RDD in any way.
例如,这将产生一个RDD,其编号为1-10:
For example, this produces an RDD with the numbers 1 - 10:
>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
map
调用为每行计算了一个新值,并返回了该值,以便获得新的RDD.但是,如果我使用foreach
,那将毫无用处,因为foreach
不会以任何方式修改rdd:
The map
call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach
that would be useless because foreach
doesn't modify the rdd in any way:
>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>
相反,在像print
这样返回None
的函数上调用map
并不是很有用:
Conversely, calling map
on a function that returns None
like print
isn't very useful:
>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]
print
调用返回None
,因此映射只给您一堆None
值,而您不想要这些值,也不想保存它们,因此返回它们是浪费的. (请注意,带有1
,2
等的行是正在执行的print
,直到您调用take
时它们才会显示,因为RDD是延迟执行的.但是,内容只是一堆None
.
The print
call returns None
so mapping that just gives you a bunch of None
values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1
, 2
, etc. are the print
being executed and they don't show up until you call take
since the RDD is executed lazily. However the contents of the RDD are just a bunch of None
.
更简单地说,如果您关心函数的返回值,请调用map
.否则,请致电foreach
.
More simply, call map
if you care about the return value of the function. Call foreach
if you don't.
这篇关于RDD.foreach()和RDD.map()之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!