Pyspark:从存档中运行脚本 [英] Pyspark: run a script from inside the archive

查看:85
本文介绍了Pyspark:从存档中运行脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个存档(基本上是捆绑的 conda 环境 + 我的应用程序),我可以轻松地在纱线主模式下与 pyspark 一起使用:

I have an archive (basically a bundled conda environment + my application) which I can easily use with pyspark in yarn master mode:

PYSPARK_PYTHON=./pkg/venv/bin/python3 \ 
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./pkg/venv/bin/python3 \
--master yarn \
--deploy-mode cluster \
--archives hdfs:///package.tgz#pkg \
app/MyScript.py

这按预期工作,这并不奇怪.

This works as expected, no surprise here.

现在,如果 MyScript.py inside package.tgz,我怎么能运行它.不在我的本地文件系统上?

Now how could I run this if MyScript.py is inside package.tgz. not on my local filesystem?

我想用例如替换命令的最后一行../pkg/app/MyScript.py 但随后 spark 抱怨:java.io.FileNotFoundException: File file:/home/blah/pkg/app/MyScript.py 不存在.

I would like to replace the last line of my command with eg. ./pkg/app/MyScript.py but then spark complains: java.io.FileNotFoundException: File file:/home/blah/pkg/app/MyScript.py does not exist.

我当然可以先提取它,然后将其单独放在 hdfs 上...有一些解决方法,但由于我将所有内容都放在一个不错的地方,我很想使用它.

I could of course extract it first, put it separately on hdfs... There are workarounds but as I have everything in one nice place, I would love to use it.

如果相关,这是 CDH 上的 spark 2.4.0、python 3.7.

If it's relevant, this is spark 2.4.0, python 3.7, on CDH.

推荐答案

据我所知,您不能:您必须向 spark-submit 提供 Python 脚本.

As I understand it, you cannot: you must supply a Python script to spark-submit.

但是你可以有一个非常短的脚本并使用 --py-files 来分发你的其余代码的 ZIP 或 EGG:

But you can have a very short script and use --py-files to distribute a ZIP or EGG of the rest of your code:

# go.py

from my.app import run

run()

# my/app.py

def run():
  print("hello")

您可以创建一个包含 my 目录的 ZIP 文件并使用短入口点脚本提交该文件:spark-submit --py-files my.zip go.py

You can create a ZIP file containing the my directory and submit that with the short entry point script: spark-submit --py-files my.zip go.py

如果你愿意,你可以创建一个通用的 go.py 接受参数,告诉它导入和运行哪个模块和方法.

If you like, you can make a generic go.py that accepts arguments telling it which module and method to import and run.

这篇关于Pyspark:从存档中运行脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆