有没有理由在编写Spark作业时不使用SparkContext.getOrCreate? [英] Is there a reason not to use SparkContext.getOrCreate when writing a spark job?

查看:177
本文介绍了有没有理由在编写Spark作业时不使用SparkContext.getOrCreate?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写Spark Jobs,它在Datastax中与Cassandra对话.

I'm writing Spark Jobs that talk to Cassandra in Datastax.

有时,在执行Spark作业中的一系列步骤时,仅获得新的RDD而不是加入旧的RDD会更容易.

Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.

您可以通过调用SparkContext [getOrCreate][1]方法来执行此操作.

You can do this by calling the SparkContext [getOrCreate][1] method.

现在,有时在Spark Job中会担心到,引用SparkContext会占用无法序列化的大对象(Spark Context),并尝试通过网络分发它.

Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.

在这种情况下-您正在为该JVM注册一个单例,因此它解决了序列化的问题.

In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.

有一天,我的技术主管来找我说

One day my tech lead came to me and said

请勿使用SparkContext getOrCreate,而应该使用联接

Don't use SparkContext getOrCreate you can and should use joins instead

但是他没有给出原因.

But he didn't give a reason.

我的问题是:写火花作业时是否有不使用SparkContext.getOrCreate的理由?

推荐答案

TL; DR 有许多getOrCreate方法的合法应用程序,但试图发现漏洞以执行地图端连接不是其中之一.

TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.

通常,SparkContext.getOrCreate没有什么大不了的.该方法有其应用,尽管有一些警告,但最值得注意的是:

In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:

  • 以最简单的形式,它不允许您设置作业特定的属性,第二个变体((SparkConf) => SparkContext)需要传递SparkConf,这与将SparkContext/SparkSession保持在范围.
  • 这可能导致具有魔术"依赖性的不透明代码.它会影响测试策略和整体代码可读性.
  • In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
  • It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.

但是,您的问题特别是:

However your question, specifically:

现在在Spark Job中有时会担心,引用SparkContext可能会占用无法序列化的大对象(Spark Context),并尝试通过网络分发它

Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network

请勿使用SparkContext getOrCreate,而应该使用联接

Don't use SparkContext getOrCreate you can and should use joins instead

建议您实际上以从未使用过的方式使用该方法.通过在执行程序节点上使用SparkContext.

suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.

val rdd: RDD[_] = ???

rdd.map(_ => {
  val sc = SparkContext.getOrCreate()
  ...
})

这绝对是你不应该做的事情.

This is definitely something that you shouldn't do.

每个Spark应用程序应该在驱动程序上初始化一个,并且只有一个SparkContext初始化,并且大量使用Apache Spark的开发人员阻止用户进行任何在驱动程序外部使用SparkContex的尝试.这不是因为SparkContext很大或无法序列化,而是因为它是Spark计算模型的基本特征.

Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.

您可能知道,Spark中的计算是通过有向非依赖性有向图来描述的,

As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:

  • 以可以转化为实际任务的方式描述处理管道.
  • 在任务失败的情况下启用正常恢复.
  • 允许适当的资源分配,并确保没有循环依赖性.
  • Describes processing pipeline in a way that can be translated into actual task.
  • Enables graceful recovery in case of task failures.
  • Allows proper resource allocation and ensures lack of cyclic dependencies.

让我们专注于最后一部分.由于每个执行器JVM都有自己的SparkContext实例,所以循环依赖性不是问题-RDDsDatasets仅存在于其父上下文的范围内,因此您将无法使用属于应用程序驱动程序的对象.

Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.

正确的资源分配是另一回事.由于每个SparkContext都创建自己的Spark应用程序,因此主"进程将无法考虑任务中初始化的上下文所使用的资源.同时,集群管理器将没有任何迹象表明应用程序或以某种方式相互连接.这可能会导致类似死锁的情况.

Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.

通过仔细地分配资源和使用管理器级别的调度池,甚至使用具有自己的集合或资源的单独的集群管理器,从技术上来说都可以解决此问题,但这并不是Spark所专为的,它不受支持,并且总体上会导致设计脆弱且费事,其中正确性取决于配置详细信息,特定的集群管理器选择和总体集群利用率.

It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.

这篇关于有没有理由在编写Spark作业时不使用SparkContext.getOrCreate?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆