RDD.foreach()和RDD.map()之间的区别 [英] Difference between RDD.foreach() and RDD.map()

查看:336
本文介绍了RDD.foreach()和RDD.map()之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Python学习Spark,想知道有人能解释操作foreach()和转换map()之间的区别吗?

I am learning Spark in Python and wondering can anyone explain the difference between the action foreach() and transformation map()?

rdd.map()返回一个新的RDD,就像Python中的原始map函数一样.但是,我想查看rdd.foreach()函数并了解它们之间的区别.谢谢!

rdd.map() returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach() function and understand the differences. Thanks!

推荐答案

一个非常简单的示例是rdd.foreach(print),它将打印RDD中每一行的值,但不以任何方式修改RDD.

A very simple example would be rdd.foreach(print) which would print the value of each row in the RDD but not modify the RDD in any way.

例如,这将产生一个RDD,其编号为1-10:

For example, this produces an RDD with the numbers 1 - 10:

>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

map调用为每行计算了一个新值,并返回了该值,以便获得新的RDD.但是,如果我使用foreach,那将毫无用处,因为foreach不会以任何方式修改rdd:

The map call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach that would be useless because foreach doesn't modify the rdd in any way:

>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>

相反,在像print这样返回None的函数上调用map并不是很有用:

Conversely, calling map on a function that returns None like print isn't very useful:

>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]

print调用返回None,因此映射只给您一堆None值,而您不想要这些值,也不想保存它们,因此返回它们是浪费的. (请注意,带有12等的行是正在执行的print,直到您调用take时它们才会显示,因为RDD是延迟执行的.但是,内容只是一堆None.

The print call returns None so mapping that just gives you a bunch of None values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1, 2, etc. are the print being executed and they don't show up until you call take since the RDD is executed lazily. However the contents of the RDD are just a bunch of None.

更简单地说,如果您关心函数的返回值,请调用map.否则,请致电foreach.

More simply, call map if you care about the return value of the function. Call foreach if you don't.

这篇关于RDD.foreach()和RDD.map()之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆