在POST/批次请求中使用现有的SparkSession [英] Use existing SparkSession in POST/batches request

查看：164 发布时间：2020/6/29 20:55:46 livy

本文介绍了在POST/批次请求中使用现有的SparkSession的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用Livy远程提交几个Spark 职位.可以说我想远程执行 spark-submit任务(使用所有选项)

spark-submit \
--class com.company.drivers.JumboBatchPipelineDriver \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1g \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.serializer='org.apache.spark.serializer.KryoSerializer' \
--conf "spark.executor.extraJavaOptions= -XX:+UseG1GC" \
--master yarn \
--deploy-mode cluster \
/home/hadoop/y2k-shubham/jars/jumbo-batch.jar \
\
--start=2012-12-21 \
--end=2012-12-21 \
--pipeline=db-importer \
--run-spiders

注意:JAR之后的选项(--start，--end等)特定于我的Spark应用程序.我为此使用 scopt

我知道我可以使用Livy POST/sessions请求允许我指定很多用于远程实例化SparkSession的选项.但是，在 POST/batches请求.

如何使用通过POST/sessions请求创建的SparkSession来通过POST/batches请求提交我的Spark作业?

我已经参考了以下示例，但是它们仅演示了在Livy的POST请求中为Spark作业提供(python)代码

pi_app

解决方案

如何使用我使用创建的SparkSession POST/sessions请求使用以下方式提交我的Spark作业 POST/batches请求?

现阶段，我几乎可以肯定这不可能
@Luqman Ghani 的评论很好地暗示了batch -mode适用于与session -mode/LivyClient

我确定无法做到这一点的原因如下(如果我错了/不完整，请纠正我)

POST/batches请求接受JAR
这会禁止重新使用SparkSession(或spark-shell)(而无需重新启动SparkSession )，因为
- 如何从先前的POST/batches请求中删除JAR?
- 如何从当前的POST/batches请求中添加JAR?

这是更完整的图片

实际上 POST/sessions 允许您传递JAR

，但是与该session的进一步交互(显然)不能占用JAR s

它们(进一步的交互)只能是简单的脚本(例如PySpark:简单的python文件)，并且可以加载到session(而不是JAR s)

可能的解决方法

所有用Scala/Java 编写的 Spark-应用程序的人，必须捆绑在JAR中，将面临此困难; Python(PySpark)用户在这里很幸运

作为一种可能的解决方法，您可以尝试执行此操作(我认为没有任何理由不起作用)

通过POST/sessions请求与您的JAR启动session

然后根据需要，通过python(提交POST /sessions/{sessionId}/statements)从JAR调用入口点-class(可能与不同参数).虽然这不是直接，但这听起来很有可能

最后，我找到了远程spark-submit 的Livy替代品；查看此
I'm trying to use Livy to remotely submit several Spark jobs. Lets say I want to perform following spark-submit task remotely (with all the options as-such)
spark-submit \ --class com.company.drivers.JumboBatchPipelineDriver \ --conf spark.driver.cores=1 \ --conf spark.driver.memory=1g \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.serializer='org.apache.spark.serializer.KryoSerializer' \ --conf "spark.executor.extraJavaOptions= -XX:+UseG1GC" \ --master yarn \ --deploy-mode cluster \ /home/hadoop/y2k-shubham/jars/jumbo-batch.jar \ \ --start=2012-12-21 \ --end=2012-12-21 \ --pipeline=db-importer \ --run-spiders
NOTE: The options after the JAR (--start, --end etc.) are specific to my Spark application. I'm using scopt for this

I'm aware that I can supply all the various options in above spark-submit command using Livy POST/batches request.

But since I have to make over 250 spark-submits remotely, I'd like to exploit Livy's session-management capabilities; i.e., I want Livy to create a SparkSession once and then use it for all my spark-submit requests.

The POST/sessions request allows me to specify quite a few options for instantiating a SparkSession remotely. However, I see no session argument in POST/batches request.

How can I make use of the SparkSession that I created using POST/sessions request for submitting my Spark job using POST/batches request?

I've referred to following examples but they only demonstrate supplying (python) code for Spark job within Livy's POST request

pi_app

rssanders3/airflow-spark-operator-plugin

livy/examples

解决方案

How can I make use of the SparkSession that I created using POST/sessions request for submitting my Spark job using POST/batches request?

At this stage, I'm all but certain that this is not possible right now

@Luqman Ghani's comment gives a fairly good hint that batch-mode is intended for different use-case than session-mode / LivyClient

The reason I've identified why this isn't possible is (please correct me if I'm wrong / incomplete) as follows

POST/batches request accepts JAR

This inhibits SparkSession (or spark-shell) from being re-used (without restarting the SparkSession) because

How would you remove JAR from previous POST/batches request?

How would you add JAR from current POST/batches request?

And here's a more complete picture

Actually POST/sessions allows you to pass a JAR

but then further interactions with that session (obviously) cannot take JARs

they (further interactions) can only be simple scripts (like PySpark: simple python files) that can be loaded into the session (and not JARs)

Possible workaround

All those who have their Spark-application written in Scala / Java, which must be bundled in a JAR, will face this difficulty; Python (PySpark) users are lucky here

As a possible workaround, you can try this (i see no reason why it wouldn't work)

launch a session with your JAR via POST/sessions request

then invoke the entrypoint-class from your JAR via python (submit POST /sessions/{sessionId}/statements) as many times as you want (with possibly different parameters). While this wouldn't be straight-forward, it sounds very much possible

Finally I found some more alternatives to Livy for remote spark-submit; see this

这篇关于在POST/批次请求中使用现有的SparkSession的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在POST/批次请求中使用现有的SparkSession [英] Use existing SparkSession in POST/batches request

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在POST/批次请求中使用现有的SparkSession [英] Use existing SparkSession in POST/batches request

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭