在python main中使用spark-submit [英] Using spark-submit with python main

查看:284
本文介绍了在python main中使用spark-submit的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Reading at this and this makes me think it is possible to have a python file be executed by spark-submit however I couldn't get it to work.

我的设置有点复杂.为了使所有功能正常运行,我需要将几个不同的jar与python文件一起提交.我的pyspark命令有效如下:

My setup is a bit complicated. I require several different jars to be submitted together with my python files in order for everything to function. My pyspark command which works is the following:

IPYTHON=1 ./pyspark --jars jar1.jar,/home/local/ANT/bogoyche/dev/rhine_workspace/env/Scala210-1.0/runtime/Scala2.10/scala-library.jar,jar2.jar --driver-class-path jar1.jar:jar2.jar
from sys import path
path.append('my-module')
from my-module import myfn
myfn(myargs)

我已经将python文件打包在一个鸡蛋中,并且该鸡蛋包含主文件,这可以通过调用python myegg.egg

I have packaged my python files inside an egg, and the egg contains the main file, which makes the egg executable by calling python myegg.egg

我现在正在尝试形成我的spark-submit命令,但似乎无法正确执行.这是我的位置:

I am trying now to form my spark-submit command and I can't seem to get it right. Here's where I am:

./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg arg1 arg
Error: Cannot load main class from JAR file:/path/to/pyspark/directory/arg1
Run with --help for usage help or --verbose for debug output

它不是执行我的.egg文件,而是采用egg的第一个参数并将其视为jar文件,并尝试从中加载类?我在做什么错了?

Instead of executing my .egg file, it is taking the first argument of the egg and considers it a jar file and tries to load a class from it? What am I doing wrong?

推荐答案

一种方法是将Spark应用程序的主驱动程序作为python文件(.py)传递给spark-submit.该主要脚本具有帮助驱动程序识别入口点的主要方法.该文件将自定义配置属性以及初始化SparkContext.

One way is to have a main driver program for your Spark application as a python file (.py) that gets passed to spark-submit. This primary script has the main method to help the Driver identify the entry point. This file will customize configuration properties as well initialize the SparkContext.

egg可执行文件中捆绑的文件是依赖项,这些依赖项被运送到执行程序节点并导入驱动程序内部.

The ones bundled in the egg executables are dependencies that are shipped to the executor nodes and imported inside the driver program.

您可以编写一个小文件作为主驱动程序并执行-

You can script a small file as main driver and execute -

./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg driver.py arg1 arg

驱动程序将类似于-

from pyspark import SparkContext, SparkConf
from my-module import myfn

if __name__ == '__main__':
    conf = SparkConf().setAppName("app")
    sc = SparkContext(conf=conf)
    myfn(myargs, sc)

在必要时将SparkContext对象作为参数传递.

Pass the SparkContext object as arguments wherever necessary.

这篇关于在python main中使用spark-submit的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆