最简单的方式来安装Python依赖关系的Spark执行器节点? [英] Easiest way to install Python dependencies on Spark executor nodes?

查看:465
本文介绍了最简单的方式来安装Python依赖关系的Spark执行器节点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道您可以将个别文件作为依赖关系发送给Python Spark程序。但是,完整的库(例如numpy)呢?



Spark有没有办法使用提供的包管理器(例如pip)来安装库依赖?或者在Spark程序执行之前必须手动完成?



如果答案是手动的,那么同步库的最佳做法方法是什么(安装路径,版本等)通过大量的分布式节点?

解决方案

其实实际上尝试过,我认为链接发表评论并不完全符合您所需要的依赖关系。你相当合理地要求的是一种方法,使Spark可以很好地与setuptools和pip有关安装依赖关系。它引起我的注意,这在Spark中不太好。第三方依赖问题在很大程度上解决了Python的通用目的,但是在Spark下,似乎假设你会回到手工依赖管理或者某些东西。



我一直在使用基于 virtualenv 的不完整但功能强大的管道。基本思想是


  1. 纯粹为您的Spark节点创建一个virtualenv

  2. 每次运行Spark工作,运行您自己的所有Python库中新鲜的 pip install 。如果您使用 setuptools 设置了这些,这将安装它们的依赖关系

  3. 压缩virtualenv的site-packages目录。这将包括您的库及其依赖关系,工作人员节点将需要它们,而不是他们已经具有的标准Python库

  4. 传递单个 .zip 文件,包含您的库及其依赖项作为的参数 - py-files

当然,您需要编写一些帮助脚本来管理此过程。这是一个从我使用过的一个帮助脚本,这可以毫无疑问地得到改进:

 #!/ usr / bin / env bash 
#helper脚本来满足Spark的python包装要求。
#将所有内容安装在指定的virtualenv中,然后将virtualenv作为
#提供给`pyspark`或`spark-submit` $ b $的--py-files参数的值b#第一个参数应该是顶级的virtualenv
#第二个参数是要创建的zipfile,
#可以随后作为
的--py-files参数提供#spark-submit
#后续参数是您要安装的所有私有软件包
#如果这些参数是用setuptools设置的,它们的依赖项将被安装

VENV = $ 1; shift
ZIPFILE = $ 1; shift
PACKAGES = $ *

。 $ VENV / bin / activate
for pkg in $ PACKAGES;做
点安装 - 升级$ pkg
done
TMPZIP =$ TMPDIR / $ RANDOM.zip#abs路径。使用随机数避免与其他进程冲突
(cd$ VENV / lib / python2.7 / site-packages&& zip -q -r $ TMPZIP。)
mv $ TMPZIP $ ZIPFILE

我收集了我运行的其他简单的包装器脚本来提交我的火花作业。我只是简单地将这个脚本称为该进程的一部分,并确保在运行 spark-submit 时,第二个参数(zip文件的名称)作为--py-files参数传递。 code>(如注释中所述)。我总是运行这些脚本,所以我永远不会不小心运行旧代码。与Spark开销相比,我的小规模项目的包装开销很小。



可以做出很多改进 - 例如,在何时创建新的zip文件,分割两个zip文件,一个包含经常变化的私有包,一个包含很少更改的依赖关系,这不需要经常重建。在重建zip之前,您可以更智能地检查文件更改。检查论据的有效性也是一个好主意。但是现在这是足够我的目的。



我提出的解决方案不是专门为大规模依赖而设计的,例如NumPy(尽管它可能适用于它) 。而且,如果您正在构建基于C的扩展,并且您的驱动程序节点与集群节点具有不同的体系结构,则它将无法正常工作。



我在其他地方看到只是运行Python发行版的建议,如 Anaconda < a>在所有节点上,因为它已经包括NumPy(和许多其他软件包),这可能是更好的获取NumPy以及其他基于C的扩展的方法。无论如何,我们不能总是希望Anaconda在正确的版本中拥有我们想要的PyPI软件包,此外,您可能无法控制Spark环境,以便将Anaconda放在其上,所以我认为这个基于virtualenv方法仍然有帮助。


I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?

Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?

If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?

解决方案

Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.

I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is

  1. Create a virtualenv purely for your Spark nodes
  2. Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies
  3. Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have
  4. Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files

Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:

#!/usr/bin/env bash
# helper script to fulfil Spark's python packaging requirements.
# Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of
# supplied to --py-files argument of `pyspark` or `spark-submit`
# First argument should be the top-level virtualenv
# Second argument is the zipfile which will be created, and
#   which you can subsequently supply as the --py-files argument to 
#   spark-submit
# Subsequent arguments are all the private packages you wish to install
# If these are set up with setuptools, their dependencies will be installed

VENV=$1; shift
ZIPFILE=$1; shift
PACKAGES=$*

. $VENV/bin/activate
for pkg in $PACKAGES; do
  pip install --upgrade $pkg
done
TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid clashes with other processes
( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . )
mv $TMPZIP $ZIPFILE

I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.

There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.

The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.

I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

这篇关于最简单的方式来安装Python依赖关系的Spark执行器节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆