一个应用程序可以有多少个SparkSession? [英] How many SparkSessions can a single application have?

查看:410
本文介绍了一个应用程序可以有多少个SparkSession?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现,随着Spark运行,并且表的大小增加(通过Joins),Spark执行程序最终将耗尽内存,整个系统崩溃.即使我尝试将临时结果写入Hive表(在HDFS上),系统仍然不会释放太多内存,并且大约130次加入后,我的整个系统崩溃.

I have found that as Spark runs, and tables grow in size (through Joins) that the spark executors will eventually run out of memory and the entire system crashes. Even if I try to write temporary results to Hive tables (on HDFS), the system still doesn't free much memory, and my entire system crashes after about 130 joins.

但是,通过实验,我意识到,如果将问题分解成较小的部分,将临时结果写入配置单元表,然后停止/启动Spark会话(和spark上下文),那么系统的资源将被释放.使用这种方法,我能够加入1,000多个专栏.

However, through experimentation, I realized that if I break the problem into smaller pieces, write temporary results to hive tables, and Stop/Start the Spark session (and spark context), then the system's resources are freed. I was able to join over 1,000 columns using this approach.

但是我找不到任何文档来理解这是否是一个好习惯(我知道您不应该一次获得多个会话).大多数系统在开始时获取会话,然后在结束时关闭会话.我还可以将应用程序分解为较小的应用程序,并使用Oozie之类的驱动程序在Yarn上调度这些较小的应用程序.但是这种方法将在每个阶段启动和停止JVM,这似乎有点繁重.

But I can't find any documentation to understand if this is considered a good practice or not (I know you should not acquire multiple sessions at once). Most systems acquire the session in the beginning and close it in the end. I could also break the application into smaller ones, and use a driver like Oozie to schedule these smaller applications on Yarn. But this approach would start and stop the JVM at each stage, which seems a bit heavy-weight.

所以我的问题是:在单个spark应用程序运行期间连续启动/停止spark会话以释放系统资源是不正确的做法吗?

So my question: is it bad practice to continually start/stop the spark session to free system resources during the run of a single spark application?

但是您能否详细说明单个JVM上的单个SparkContext的含义?我可以调用sparkSession.sparkContext().stop(),也可以调用stop SparkSession.然后,我创建了一个新的SparkSession并使用了一个新的sparkContext.没有引发任何错误.

But can you elaborate on what you mean by a single SparkContext on a single JVM? I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession. I then created a new SparkSession and used a new sparkContext. No error was thrown.

我也可以在JavaSparkPi上使用它,而没有任何问题.

I was also able to use this on the JavaSparkPi without any problems.

我已经在yarn-clientlocal spark安装中对此进行了测试.

I have tested this in yarn-client and a local spark install.

停止spark上下文到底有什么作用,为什么停止一个火花后又不能创建一个新的火花呢?

What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?

推荐答案

TL; DR 您可以根据需要设置任意数量的SparkSession.

TL;DR You can have as many SparkSessions as needed.

在单个JVM上可以只有一个SparkContext,但是SparkSession的数量几乎是无限的.

You can have one and only one SparkContext on a single JVM, but the number of SparkSessions is pretty much unbounded.

但是您能否详细说明单个JVM上的单个SparkContext的含义?

But can you elaborate on what you mean by a single SparkContext on a single JVM?

这意味着在Spark应用程序的生命周期中的任何给定时间,驱动程序只能是一个,也只能是一个,这又意味着该JVM上只有一个SparkContext可用.

It means that at any given time in the lifecycle of a Spark application the driver can only be one and only one which in turn means that there's one and only one SparkContext on that JVM available.

SparkContext所在的位置是Spark应用程序的驱动程序(或者相反,SparkContext定义了驱动程序-区别非常模糊).

The driver of a Spark application is where the SparkContext lives (or it's the opposite rather where SparkContext defines the driver -- the distinction is pretty much blurry).

一次只能有一个SparkContext.尽管您可以根据需要随意启动和停止它多次,但是我记得有一个问题说除非您对Spark有所了解,否则您不应该关闭SparkContext(这通常发生在Spark应用程序的最后) ).

You can only have one SparkContext at one time. Although you can start and stop it on demand as many times you want, but I remember an issue about it that said you should not close SparkContext unless you're done with Spark (which usually happens at the very end of your Spark application).

换句话说,在您的Spark应用程序的整个生命周期中只有一个SparkContext.

In other words, have a single SparkContext for the entire lifetime of your Spark application.

还有一个类似的问题 SparkSession.sql与Dataset.sqlContext.sql之间的区别是什么?关于多个可以更清楚地说明您为什么要进行两个或多个会话.

There was a similar question What's the difference between SparkSession.sql vs Dataset.sqlContext.sql? about multiple SparkSessions that can shed more light on why you'd want to have two or more sessions.

我可以叫sparkSession.sparkContext().stop(),也可以叫stop SparkSession.

所以?!这与我说的话有什么矛盾?您停止了JVM上唯一可用的SparkContext.没有大碍.您可以,但这只是在单个JVM上只能有一个而且只有一个SparkContext"的一部分,不是吗?

So?! How does this contradict what I said?! You stopped the only SparkContext available on the JVM. Not a big deal. You could, but that's just one part of "you can only have one and only one SparkContext on a single JVM available", isn't it?

SparkSession只是SparkContext的包装,用于在Spark Core的RDD之上提供Spark SQL的结构化/SQL功能.

SparkSession is a mere wrapper around SparkContext to offer Spark SQL's structured/SQL features on top of Spark Core's RDDs.

从Spark SQL开发人员的角度来看,SparkSession的目的是成为查询实体(例如表,视图或查询所使用的函数(如DataFrame,Dataset或SQL)和Spark属性(可以是每个SparkSession具有不同的值).

From the point of Spark SQL developer, the purpose of a SparkSession is to be a namespace for query entities like tables, views or functions that your queries use (as DataFrames, Datasets or SQL) and Spark properties (that could have different values per SparkSession).

如果您要为不同的数据集使用相同的(临时)表名,则建议使用创建两个SparkSession的方式.

If you'd like to have the same (temporary) table name used for different Datasets, creating two SparkSessions would be what I'd consider the recommended way.

我刚刚研究了一个示例,以演示整个阶段的代码生成如何在Spark SQL中工作,并创建了以下代码来简单地关闭该功能.

I've just worked on an example to showcase how whole-stage codegen works in Spark SQL and have created the following that simply turns the feature off.

// both where and select operators support whole-stage codegen
// the plan tree (with the operators and expressions) meets the requirements
// That's why the plan has WholeStageCodegenExec inserted
// You can see stars (*) in the output of explain
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
scala> q.explain
== Physical Plan ==
*Project [_2#89 AS c0#93]
+- *Filter (_1#88 = 0)
   +- LocalTableScan [_1#88, _2#89, _3#90]

// Let's break the requirement of having up to spark.sql.codegen.maxFields
// I'm creating a brand new SparkSession with one property changed
val newSpark = spark.newSession()
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_MAX_NUM_FIELDS
newSpark.sessionState.conf.setConf(WHOLESTAGE_MAX_NUM_FIELDS, 2)

scala> println(newSpark.sessionState.conf.wholeStageMaxNumFields)
2

// Let's see what's the initial value is
// Note that I use spark value (not newSpark)
scala> println(spark.sessionState.conf.wholeStageMaxNumFields)
100

import newSpark.implicits._
// the same query as above but created in SparkSession with WHOLESTAGE_MAX_NUM_FIELDS as 2
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)

// Note that there are no stars in the output of explain
// No WholeStageCodegenExec operator in the plan => whole-stage codegen disabled
scala> q.explain
== Physical Plan ==
Project [_2#122 AS c0#126]
+- Filter (_1#121 = 0)
   +- LocalTableScan [_1#121, _2#122, _3#123]

然后我创建了一个新的SparkSession并使用了一个新的SparkContext.没有引发任何错误.

I then created a new SparkSession and used a new SparkContext. No error was thrown.

再次,这与我所说的关于单个SparkContext可用的说法有何矛盾?我很好奇.

Again, how does this contradict what I said about a single SparkContext being available? I'm curious.

停止spark上下文到底有什么作用,为什么停止一个火花后又不能创建一个新的火花呢?

What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?

您不能再使用它来运行Spark作业(以处理大型和分布式数据集),这完全就是您首先使用Spark的原因,不是吗?

You can no longer use it to run Spark jobs (to process large and distributed datasets) which is pretty much exactly the reason why you use Spark in the first place, doesn't it?

请尝试以下操作:

  1. 停止SparkContext
  2. 使用Spark Core的RDD或Spark SQL的数据集API执行任何处理

有例外吗?正确的!请记住,您关闭了Spark的门",所以您怎么会期望进入其中? :)

An exception? Right! Remember that you close the "doors" to Spark so how could you have expected to be inside?! :)

这篇关于一个应用程序可以有多少个SparkSession?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆