在POST/批次请求中使用现有的SparkSession [英] Use existing SparkSession in POST/batches request
问题描述
我正在尝试使用Livy
远程提交几个Spark
职位.可以说我想远程执行 spark-submit
任务(使用所有选项)
spark-submit \
--class com.company.drivers.JumboBatchPipelineDriver \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1g \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.serializer='org.apache.spark.serializer.KryoSerializer' \
--conf "spark.executor.extraJavaOptions= -XX:+UseG1GC" \
--master yarn \
--deploy-mode cluster \
/home/hadoop/y2k-shubham/jars/jumbo-batch.jar \
\
--start=2012-12-21 \
--end=2012-12-21 \
--pipeline=db-importer \
--run-spiders
注意:JAR
之后的选项(--start
,--end
等)特定于我的Spark
应用程序.我为此使用 scopt
-
我知道我可以使用
Livy
POST/sessions
请求允许我指定很多用于远程实例化SparkSession
的选项.但是,在POST/batches
请求.
如何使用通过POST/sessions
请求创建的SparkSession
来通过POST/batches
请求提交我的Spark
作业?
我已经参考了以下示例,但是它们仅演示了在Livy
的POST
请求中为Spark
作业提供(python
)代码
如何使用我使用创建的
SparkSession
POST/sessions
请求使用以下方式提交我的Spark
作业POST/batches
请求?
- 现阶段,我几乎可以肯定这不可能
- @Luqman Ghani 的评论很好地暗示了
batch
-mode适用于与session
-mode/LivyClient
不同的用例
我确定无法做到这一点的原因如下(如果我错了/不完整,请纠正我)
-
POST/batches
请求接受JAR
- 这会禁止重新使用
SparkSession
(或spark-shell
)(而无需重新启动SparkSession
),因为- 如何从先前的
POST/batches
请求中删除JAR
? - 如何从当前的
POST/batches
请求中添加JAR
?
- 如何从先前的
这是更完整的图片
- 实际上
POST/sessions
允许您传递JAR
- ,但是与该
session
的进一步交互(显然)不能占用JAR
s - 它们(进一步的交互)只能是简单的脚本(例如
PySpark
:简单的python
文件),并且可以加载到session
(而不是JAR
s)
可能的解决方法
- 所有用
Scala
/Java
编写的Spark
-应用程序的人,必须捆绑在JAR
中,将面临此困难;Python
(PySpark
)用户在这里很幸运 - 作为一种可能的解决方法,您可以尝试执行此操作(我认为没有任何理由不起作用)
- 通过
POST/sessions
请求与您的JAR
启动session
- 然后根据需要,通过
python
(提交POST /sessions/{sessionId}/statements
)从JAR
调用入口点-class
(可能与不同参数).虽然这不是直接,但这听起来很有可能
- 通过
最后,我找到了远程spark-submit
的Livy
替代品; 查看此
I'm trying to use Livy
to remotely submit several Spark
jobs. Lets say I want to perform following spark-submit
task remotely (with all the options as-such)
spark-submit \
--class com.company.drivers.JumboBatchPipelineDriver \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1g \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.serializer='org.apache.spark.serializer.KryoSerializer' \
--conf "spark.executor.extraJavaOptions= -XX:+UseG1GC" \
--master yarn \
--deploy-mode cluster \
/home/hadoop/y2k-shubham/jars/jumbo-batch.jar \
\
--start=2012-12-21 \
--end=2012-12-21 \
--pipeline=db-importer \
--run-spiders
NOTE: The options after the JAR
(--start
, --end
etc.) are specific to my Spark
application. I'm using scopt
for this
I'm aware that I can supply all the various options in above
spark-submit
command usingLivy
POST/batches
request.But since I have to make over 250
spark-submit
s remotely, I'd like to exploitLivy
's session-management capabilities; i.e., I wantLivy
to create aSparkSession
once and then use it for all myspark-submit
requests.The
POST/sessions
request allows me to specify quite a few options for instantiating aSparkSession
remotely. However, I see nosession
argument inPOST/batches
request.
How can I make use of the SparkSession
that I created using POST/sessions
request for submitting my Spark
job using POST/batches
request?
I've referred to following examples but they only demonstrate supplying (python
) code for Spark
job within Livy
's POST
request
How can I make use of the
SparkSession
that I created usingPOST/sessions
request for submitting mySpark
job usingPOST/batches
request?
- At this stage, I'm all but certain that this is not possible right now
- @Luqman Ghani's comment gives a fairly good hint that
batch
-mode is intended for different use-case thansession
-mode /LivyClient
The reason I've identified why this isn't possible is (please correct me if I'm wrong / incomplete) as follows
POST/batches
request acceptsJAR
- This inhibits
SparkSession
(orspark-shell
) from being re-used (without restarting theSparkSession
) because- How would you remove
JAR
from previousPOST/batches
request? - How would you add
JAR
from currentPOST/batches
request?
- How would you remove
And here's a more complete picture
- Actually
POST/sessions
allows you to pass aJAR
- but then further interactions with that
session
(obviously) cannot takeJAR
s - they (further interactions) can only be simple scripts (like
PySpark
: simplepython
files) that can be loaded into thesession
(and notJAR
s)
Possible workaround
- All those who have their
Spark
-application written inScala
/Java
, which must be bundled in aJAR
, will face this difficulty;Python
(PySpark
) users are lucky here - As a possible workaround, you can try this (i see no reason why it wouldn't work)
- launch a
session
with yourJAR
viaPOST/sessions
request - then invoke the entrypoint-
class
from yourJAR
viapython
(submitPOST /sessions/{sessionId}/statements
) as many times as you want (with possibly different parameters). While this wouldn't be straight-forward, it sounds very much possible
- launch a
Finally I found some more alternatives to Livy
for remote spark-submit
; see this
这篇关于在POST/批次请求中使用现有的SparkSession的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!