纱线群集模式下的Pyspark [英] Pyspark on yarn-cluster mode

查看：219 发布时间：2020/9/4 6:56:51 apache-spark yarn pyspark

本文介绍了纱线群集模式下的Pyspark的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否有任何方法可以在不使用spark-submit脚本的情况下以yarn-cluster模式运行pyspark脚本?我需要这种方式，因为我会将这段代码集成到Django Web应用程序中.

Is there any way to run pyspark scripts with yarn-cluster mode without using the spark-submit script? I need it in this way because i will integrate this code into a django web app.

当我尝试在yarn-cluster模式下运行任何脚本时，出现以下错误:

When i try to run any script in yarn-cluster mode i got the following error :

org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

我通过以下方式创建sparkContext:

I'm creating the sparkContext in the following way :

        conf = (SparkConf()
            .setMaster("yarn-cluster")
            .setAppName("DataFrameTest"))

        sc = SparkContext(conf = conf)

        #Dataframe code ....

谢谢

推荐答案

不支持yarn-cluster模式的原因是yarn-cluster意味着将驱动程序本身引导(例如，使用SparkContext调用的程序)纱线容器.从您关于从Django Web应用程序提交的陈述中猜出，听起来您希望将包含SparkContext的python代码嵌入到Web应用程序本身中，而不是将驱动程序代码运送到YARN容器中，然后再由该容器处理单独的spark作业

The reason yarn-cluster mode isn't supported is that yarn-cluster means bootstrapping the driver-program itself (e.g. the program calling using a SparkContext) onto a YARN container. Guessing from your statement about submitting from a django web app, it sounds like you want the python code that contains the SparkContext to be embedded in the web app itself, rather than shipping the driver code onto a YARN container which then handles a separate spark job.

这意味着您的案子最适合使用yarn-client模式而不是yarn-cluster；在yarn-client模式下，您可以在任何位置(例如，在Web应用程序内部)运行SparkContext代码，同时它与YARN对话以了解运行作业的实际机制.

This means your case most closely fits with yarn-client mode instead of yarn-cluster; in yarn-client mode, you can run your SparkContext code anywhere (like inside your web app), while it talks to YARN for the actual mechanics of running jobs.

从根本上讲，如果要在Web应用程序和Spark代码之间共享任何内存中状态，则意味着您将无法切断Spark部分以在YARN容器中运行，这就是尝试做.如果您不共享状态，则可以简单地调用一个实际上调用spark-submit的子进程来捆绑一个独立的PySpark作业，使其在yarn-cluster模式下运行.

Fundamentally, if you're sharing any in-memory state between your web app and your Spark code, that means you won't be able to chop off the Spark portion to run inside a YARN container, which is what yarn-cluster tries to do. If you're not sharing state, then you can simply invoke a subprocess which actually does call spark-submit to bundle an independent PySpark job to run in yarn-cluster mode.

总结:

如果您想直接在您的Web应用程序中嵌入Spark代码，则需要改用yarn-client模式:SparkConf().setMaster("yarn-client")
如果Spark代码之间的松散耦合足以使yarn-cluster实际上可行，则可以发出Python subprocess 在yarn-cluster模式下实际调用spark-submit.

If you want to embed your Spark code directly in your web app, you need to use yarn-client mode instead: SparkConf().setMaster("yarn-client")
If the Spark code is loosely coupled enough that yarn-cluster is actually viable, you can issue a Python subprocess to actually invoke spark-submit in yarn-cluster mode.

这篇关于纱线群集模式下的Pyspark的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

纱线群集模式下的Pyspark [英] Pyspark on yarn-cluster mode

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

纱线群集模式下的Pyspark [英] Pyspark on yarn-cluster mode

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭