spark-submit 与特定的 python 库 [英] spark-submit with specific python librairies
问题描述
我有一个依赖于第三方库的 pyspark 代码.我想在 mesos 下运行的集群上执行此代码.
I have a pyspark code depending on third party librairies. I want to execute this code on my cluster which run under mesos.
我的 Python 环境确实有一个压缩版本,它位于我的集群可访问的 http 服务器上.
I do have a zipped version of my python environment that is on a http server reachable by my cluster.
我在指定我的 spark-submit 查询以使用此环境时遇到了一些麻烦.我使用 --archives
加载 zip 文件和 --conf 'spark.pyspark.driver.python=path/to/my/env/bin/python'
加上 --conf 'spark.pyspark.python=path/to/my/env/bin/python'
来指定事物.
I have some trouble to specify to my spark-submit query to use this environment.
I use both --archives
to load the zip file and --conf 'spark.pyspark.driver.python=path/to/my/env/bin/python'
plus --conf 'spark.pyspark.python=path/to/my/env/bin/python'
to specify the thing.
这似乎不起作用...我做错了什么吗?你知道怎么做吗?
This does not seem to work... Do I do something wrong? Do you have any idea on how to do that?
干杯,亚历克斯
推荐答案
如果您有依赖项,可能对某些人有帮助.
May be helpful to some people, if you have dependencies.
我找到了一个关于如何将虚拟环境正确加载到主节点和所有从节点的解决方案:
I found a solution on how to properly load a virtual environment to the master and all the slave workers:
virtualenv venv --relocatable
cd venv
zip -qr ../venv.zip *
PYSPARK_PYTHON=./SP/bin/python spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./SP/bin/python --driver-memory 4G --archives venv.zip#SP filename.py
这篇关于spark-submit 与特定的 python 库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!