编写 spark 作业时是否有理由不使用 SparkContext.getOrCreate? [英] Is there a reason not to use SparkContext.getOrCreate when writing a spark job?

查看:44
本文介绍了编写 spark 作业时是否有理由不使用 SparkContext.getOrCreate?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写与 Datastax 中的 Cassandra 对话的 Spark 作业.

I'm writing Spark Jobs that talk to Cassandra in Datastax.

有时在执行 Spark 作业中的一系列步骤时,获取新 RDD 比加入旧 RDD 更容易.

Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.

您可以通过调用 SparkContext [getOrCreate][1] 方法来完成此操作.

You can do this by calling the SparkContext [getOrCreate][1] method.

现在有时会在 Spark 作业内部担心引用 SparkContext 可能会采用不可序列化的大对象(Spark Context)并尝试通过网络分发它.

Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.

在这种情况下 - 您正在为该 JVM 注册一个单例,因此它解决了序列化问题.

In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.

有一天我的技术负责人来找我说

One day my tech lead came to me and said

不要使用 SparkContext getOrCreate 你可以而且应该使用连接

Don't use SparkContext getOrCreate you can and should use joins instead

但他没有给出理由.

我的问题是:在编写 Spark 作业时是否有理由不使用 SparkContext.getOrCreate?

推荐答案

TL;DR getOrCreate 方法有很多合法的应用,但试图找到一个漏洞执行地图端连接不是其中之一.

TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.

总的来说,SparkContext.getOrCreate 没有什么大问题.该方法有其应用,尽管有一些警告,最值得注意的是:

In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:

  • 在最简单的形式中,它不允许您设置作业特定的属性,第二个变体 ((SparkConf) => SparkContext) 需要传递 SparkConf,这与将 SparkContext/SparkSession 保持在范围内相比几乎没有改进.
  • 它可能导致具有魔法"依赖性的不透明代码.它会影响测试策略和整体代码可读性.
  • In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
  • It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.

但是你的问题,特别是:

However your question, specifically:

现在有时会在 Spark 作业内部担心引用 SparkContext 可能会采用不可序列化的大对象(Spark Context)并尝试通过网络分发它

Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network

不要使用 SparkContext getOrCreate 你可以而且应该使用连接

Don't use SparkContext getOrCreate you can and should use joins instead

表明您实际上正在以一种从未打算使用的方式使用该方法.通过在执行器节点上使用 SparkContext.

suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.

val rdd: RDD[_] = ???

rdd.map(_ => {
  val sc = SparkContext.getOrCreate()
  ...
})

这绝对是你不应该做的事情.

This is definitely something that you shouldn't do.

每个 Spark 应用程序都应该有一个并且只有一个 SparkContext 在驱动程序上初始化,Apache Spark 开发人员做了很多防止用户尝试在外部使用 SparkContex司机.不是因为SparkContext很大,或者无法序列化,而是因为它是Spark计算模型的基本特征.

Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.

您可能知道,Spark 中的计算由依赖关系的有向无环图描述,其中:

As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:

  • 以可转化为实际任务的方式描述处理管道.
  • 在任务失败的情况下启用正常恢复.
  • 允许适当的资源分配并确保没有循环依赖.
  • Describes processing pipeline in a way that can be translated into actual task.
  • Enables graceful recovery in case of task failures.
  • Allows proper resource allocation and ensures lack of cyclic dependencies.

让我们关注最后一部分.由于每个执行器 JVM 都有自己的 SparkContext 实例,循环依赖不是问题 - RDDsDatasets 仅存在于其父上下文的范围内因此您将无法访问属于应用程序驱动程序的对象.

Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.

适当的资源分配是另一回事.由于每个 SparkContext 都创建了自己的 Spark 应用程序,因此您的主"进程将无法考虑任务中初始化的上下文所使用的资源.同时集群管理器不会有任何迹象表明应用程序或以某种方式互连.这很可能会导致类似死锁的情况.

Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.

在技术上可以绕过它,仔细分配和使用管理器级调度池,甚至是一个单独的集群管理器,拥有自己的集合或资源,但这不是 Spark 的设计目的,它不受支持,并且总体上会导致设计脆弱和复杂,其正确性取决于配置细节、特定的集群管理器选择和整体集群利用率.

It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.

这篇关于编写 spark 作业时是否有理由不使用 SparkContext.getOrCreate?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆