在python main中使用spark-submit [英] Using spark-submit with python main
问题描述
在此和
Reading at this and this makes me think it is possible to have a python file be executed by spark-submit
however I couldn't get it to work.
我的设置有点复杂.为了使所有功能正常运行,我需要将几个不同的jar与python文件一起提交.我的pyspark
命令有效如下:
My setup is a bit complicated. I require several different jars to be submitted together with my python files in order for everything to function. My pyspark
command which works is the following:
IPYTHON=1 ./pyspark --jars jar1.jar,/home/local/ANT/bogoyche/dev/rhine_workspace/env/Scala210-1.0/runtime/Scala2.10/scala-library.jar,jar2.jar --driver-class-path jar1.jar:jar2.jar
from sys import path
path.append('my-module')
from my-module import myfn
myfn(myargs)
我已经将python文件打包在一个鸡蛋中,并且该鸡蛋包含主文件,这可以通过调用python myegg.egg
I have packaged my python files inside an egg, and the egg contains the main file, which makes the egg executable by calling python myegg.egg
我现在正在尝试形成我的spark-submit
命令,但似乎无法正确执行.这是我的位置:
I am trying now to form my spark-submit
command and I can't seem to get it right. Here's where I am:
./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg arg1 arg
Error: Cannot load main class from JAR file:/path/to/pyspark/directory/arg1
Run with --help for usage help or --verbose for debug output
它不是执行我的.egg文件,而是采用egg的第一个参数并将其视为jar文件,并尝试从中加载类?我在做什么错了?
Instead of executing my .egg file, it is taking the first argument of the egg and considers it a jar file and tries to load a class from it? What am I doing wrong?
推荐答案
一种方法是将Spark应用程序的主驱动程序作为python文件(.py)传递给spark-submit.该主要脚本具有帮助驱动程序识别入口点的主要方法.该文件将自定义配置属性以及初始化SparkContext.
One way is to have a main driver program for your Spark application as a python file (.py) that gets passed to spark-submit. This primary script has the main method to help the Driver identify the entry point. This file will customize configuration properties as well initialize the SparkContext.
egg可执行文件中捆绑的文件是依赖项,这些依赖项被运送到执行程序节点并导入驱动程序内部.
The ones bundled in the egg executables are dependencies that are shipped to the executor nodes and imported inside the driver program.
您可以编写一个小文件作为主驱动程序并执行-
You can script a small file as main driver and execute -
./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg driver.py arg1 arg
驱动程序将类似于-
from pyspark import SparkContext, SparkConf
from my-module import myfn
if __name__ == '__main__':
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
myfn(myargs, sc)
在必要时将SparkContext
对象作为参数传递.
Pass the SparkContext
object as arguments wherever necessary.
这篇关于在python main中使用spark-submit的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!