如何强制 Spark 执行代码? [英] How can I force Spark to execute code?

查看:28
本文介绍了如何强制 Spark 执行代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何强制 Spark 执行对 map 的调用,即使它由于其惰性求值而认为不需要执行?

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?

我曾尝试将 cache() 与 map 调用结合在一起,但这仍然不起作用.我的地图方法实际上将结果上传到 HDFS.所以,它不是没用,但 Spark 认为它是.

I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.

推荐答案

简短回答:

要强制 Spark 执行转换,您需要需要一个结果.有时一个简单的 count 操作就足够了.

To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.

TL;DR:

好的,让我们回顾一下 RDD 操作.

Ok, let's review the RDD operations.

RDDs 支持两种操作:

  • 转换 - 从现有数据集创建新数据集.
  • actions - 在对数据集运行计算后向驱动程序返回一个值.
  • transformations - which create a new dataset from an existing one.
  • actions - which return a value to the driver program after running a computation on the dataset.

例如,map 是一种转换,它通过一个函数传递每个数据集元素并返回一个表示结果的新 RDD.另一方面,reduce 是一个动作,它使用某个函数聚合 RDD 的所有元素并将最终结果返回给驱动程序(虽然也有一个并行的 reduceByKey 返回分布式数据集).

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

Spark 中的所有转换都是懒惰,因为它们不会立即计算结果.

All transformations in Spark are lazy, in that they do not compute their results right away.

相反,他们只记住应用于某些基本数据集(例如文件)的转换.仅当操作需要将结果返回给驱动程序时才计算转换.这样的设计让 Spark 运行起来更高效——例如,我们可以意识到通过 map 创建的数据集会在 reduce 中使用,并且只将 reduce 的结果返回给驱动程序,而不是更大的映射数据集.

Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

默认情况下,每次在其上运行操​​作时,每个转换后的 RDD 可能会重新计算.但是,您也可以使用 persist(或 cache)方法将 RDD 持久化在内存中,在这种情况下,Spark 会将元素保持在下次查询时可以更快地访问集群.还支持将 RDD 持久保存在磁盘上,或跨多个节点复制.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

要强制 Spark 执行对 map 的调用,您需要需要一个结果.有时一个 count 操作就足够了.

To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.

这篇关于如何强制 Spark 执行代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆