在 python main 中使用 spark-submit [英] Using spark-submit with python main

查看:23
本文介绍了在 python main 中使用 spark-submit的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

阅读 this这个 让我觉得可以让 执行一个 python 文件spark-submit 但是我无法让它工作.

Reading at this and this makes me think it is possible to have a python file be executed by spark-submit however I couldn't get it to work.

我的设置有点复杂.我需要将几个不同的 jar 与我的 python 文件一起提交,以便一切正常运行.我的 pyspark 命令如下:

My setup is a bit complicated. I require several different jars to be submitted together with my python files in order for everything to function. My pyspark command which works is the following:

IPYTHON=1 ./pyspark --jars jar1.jar,/home/local/ANT/bogoyche/dev/rhine_workspace/env/Scala210-1.0/runtime/Scala2.10/scala-library.jar,jar2.jar --driver-class-path jar1.jar:jar2.jar
from sys import path
path.append('my-module')
from my-module import myfn
myfn(myargs)

我已经将我的python文件打包在一个egg中,egg包含主文件,通过调用python myegg.egg

I have packaged my python files inside an egg, and the egg contains the main file, which makes the egg executable by calling python myegg.egg

我现在正在尝试形成我的 spark-submit 命令,但我似乎无法正确完成.我在这里:

I am trying now to form my spark-submit command and I can't seem to get it right. Here's where I am:

./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg arg1 arg
Error: Cannot load main class from JAR file:/path/to/pyspark/directory/arg1
Run with --help for usage help or --verbose for debug output

它没有执行我的 .egg 文件,而是采用 egg 的第一个参数并将其视为一个 jar 文件并尝试从中加载一个类?我做错了什么?

Instead of executing my .egg file, it is taking the first argument of the egg and considers it a jar file and tries to load a class from it? What am I doing wrong?

推荐答案

一种方法是将 Spark 应用程序的主驱动程序作为传递给 spark-submit 的 python 文件 (.py).此主脚本具有帮助驱动程序识别入口点的主要方法.该文件将自定义配置属性并初始化 SparkContext.

One way is to have a main driver program for your Spark application as a python file (.py) that gets passed to spark-submit. This primary script has the main method to help the Driver identify the entry point. This file will customize configuration properties as well initialize the SparkContext.

捆绑在 egg 可执行文件中的是依赖项,它们被传送到执行器节点并导入到驱动程序中.

The ones bundled in the egg executables are dependencies that are shipped to the executor nodes and imported inside the driver program.

您可以编写一个小文件作为主驱动程序并执行-

You can script a small file as main driver and execute -

./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg driver.py arg1 arg

驱动程序类似于 -

from pyspark import SparkContext, SparkConf
from my-module import myfn

if __name__ == '__main__':
    conf = SparkConf().setAppName("app")
    sc = SparkContext(conf=conf)
    myfn(myargs, sc)

在必要时将 SparkContext 对象作为参数传递.

Pass the SparkContext object as arguments wherever necessary.

这篇关于在 python main 中使用 spark-submit的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆