在星火执行人节点安装Python的依赖最简单的方法? [英] Easiest way to install Python dependencies on Spark executor nodes?

查看:130
本文介绍了在星火执行人节点安装Python的依赖最简单的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,您可以发送单个文件,使用Python程序星火依赖。但是关于全面库的内容(例如numpy的)?

I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?

星火是否有办法使用提供的软件包管理器(例如PIP)安装库的依赖?或者,这都需要手动完成执行星火计划之前?

Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?

如果答案是手动的,那么什么是最佳实践的方法已有大量分布式节点的同步库(安装路径,版本等)?

If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?

推荐答案

实际上有真正尝试过,我想我张贴评论不会做的正是你希望与什么相关性的链接。你所相当合理要求的是一种方法,有关于安装依赖的setuptools和PIP星火发挥很好。它吹拂我的心灵,这是不支持星火更好。第三方依赖问题在很大程度上解决了通用的Python,但在星火,似乎假设是你会回到手动依赖管理什么的。

Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.

我一直在使用基于的virtualenv 一个不完美的,但功能性管道。其基本思路是

I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is


  1. 您星火节点创建的virtualenv纯粹

  2. 每次运行星火作业时,运行一个新的 PIP安装全部由您自己内部的Python库。如果已设置这些了的setuptools ,这将安装它们的依赖

  3. 拉链拉上的站点包的virtualenv中的目录。这将包括你的图书馆以及它的依赖,该工作节点都需要,但不是标准的Python库,他们已经有了

  4. 传递单 .zip文件文件,其中包含你的库和他们的依赖作为参数传递给 - PY-文件

  1. Create a virtualenv purely for your Spark nodes
  2. Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies
  3. Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have
  4. Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files

当然,你会想code一些辅助脚本来管理这个过程。这里是一个辅助脚本改编自一个我一直在使用,这可能无疑会提高了不少:

Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:

#!/usr/bin/env bash
# helper script to fulfil Spark's python packaging requirements.
# Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of
# supplied to --py-files argument of `pyspark` or `spark-submit`
# First argument should be the top-level virtualenv
# Second argument is the zipfile which will be created, and
#   which you can subsequently supply as the --py-files argument to 
#   spark-submit
# Subsequent arguments are all the private packages you wish to install
# If these are set up with setuptools, their dependencies will be installed

VENV=$1; shift
ZIPFILE=$1; shift
PACKAGES=$*

. $VENV/bin/activate
for pkg in $PACKAGES; do
  pip install --upgrade $pkg
done
TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid clashes with other processes
( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . )
mv $TMPZIP $ZIPFILE

我有其他简单的包装脚本,我跑提交我的火花作业的集合。我先简单把这个脚本作为这一进程的一部分,并确保第二个参数(zip文件的名称),然后作为--py-文件参数进行传递,当我运行火花提交(如记录在评论)。我始终运行这些脚本,所以我从来没有结束无意中运行旧code。相比星火开销,包装开销最小的为我的小规模的项目。

I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.

有改进的负载可以作出 - 例如聪明什么时候创建一个新的zip文件,分裂就两个zip文件,一个包含经常变化的私人包和含有很少更改的依赖之一,不要吨需要如此频繁重建。你可能是有关重建的zip前检查文件的变化更聪明。另外的参数检查的有效性将是一个不错的主意。然而,对于现在这足以满足我的目的。

There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.

解决方案我想出了不适合大型依赖像numpy的特别(虽然它可能为他们工作)。另外,如果您正在构建基于C语言的扩展将无法正常工作,而你的驱动程序节点有不同的架构,群集节点。

The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.

我见过其他地方的建议只是运行所有节点上如蟒蛇一个Python发行以来它已经包括numpy的(和很多其他的包),这可能是更好的办法得到numpy的以及其它基于C语言的扩展下去。无论如何,我们不能总是指望蟒蛇有我们要在正确版本的PyPI包,此外,你可能无法控制你的星火环境,能够把蟒蛇就可以了,所以我觉得这个基础的virtualenv,方法仍然是有帮助的。

I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

这篇关于在星火执行人节点安装Python的依赖最简单的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆