在 Spark 执行器节点上安装 Python 依赖项的最简单方法是什么? [英] Easiest way to install Python dependencies on Spark executor nodes?

查看:45
本文介绍了在 Spark 执行器节点上安装 Python 依赖项的最简单方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解您可以使用 Python Spark 程序将单个文件作为依赖项发送.但是成熟的库(例如 numpy)呢?

I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?

Spark 是否有办法使用提供的包管理器(例如 pip)来安装库依赖项?还是必须在执行 Spark 程序之前手动完成?

Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?

如果答案是手动,那么在大量分布式节点上同步库(安装路径、版本等)的最佳实践"方法是什么?

If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?

推荐答案

实际上已经尝试过了,我认为我作为评论发布的链接并不能完全满足您对依赖项的要求.您相当合理地要求的是一种让 Spark 与 setuptools 和 pip 很好地配合安装依赖项的方法.让我大吃一惊的是,Spark 并没有更好地支持这一点.第三方依赖问题在通用 Python 中基本上得到了解决,但在 Spark 下,似乎假设你会回到手动依赖管理之类的.

Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.

我一直在使用基于 virtualenv 的不完美但功能强大的管道.基本思路是

I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is

  1. 为您的 Spark 节点创建一个纯虚拟环境
  2. 每次运行 Spark 作业时,运行您自己的所有内部 Python 库的全新 pip install.如果您使用 setuptools 设置了这些,这将安装它们的依赖项
  3. 压缩 virtualenv 的 site-packages 目录.这将包括您的库及其依赖项,工作节点将需要这些库,但不包括他们已经拥有的标准 Python 库
  4. 将包含您的库及其依赖项的单个 .zip 文件作为参数传递给 --py-files
  1. Create a virtualenv purely for your Spark nodes
  2. Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies
  3. Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have
  4. Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files

当然,您需要编写一些帮助脚本来管理此过程.这是一个从我一直在使用的脚本改编而来的帮助脚本,它无疑可以改进很多:

Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:

#!/usr/bin/env bash
# helper script to fulfil Spark's python packaging requirements.
# Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of
# supplied to --py-files argument of `pyspark` or `spark-submit`
# First argument should be the top-level virtualenv
# Second argument is the zipfile which will be created, and
#   which you can subsequently supply as the --py-files argument to 
#   spark-submit
# Subsequent arguments are all the private packages you wish to install
# If these are set up with setuptools, their dependencies will be installed

VENV=$1; shift
ZIPFILE=$1; shift
PACKAGES=$*

. $VENV/bin/activate
for pkg in $PACKAGES; do
  pip install --upgrade $pkg
done
TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid clashes with other processes
( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . )
mv $TMPZIP $ZIPFILE

我有一组其他简单的包装脚本,我运行这些脚本来提交我的 Spark 作业.我只是首先调用这个脚本作为该过程的一部分,并确保当我运行 spark-submit 时第二个参数(zip 文件的名称)作为 --py-files 参数传递(如评论中所述).我总是运行这些脚本,所以我永远不会意外地运行旧代码.与 Spark 开销相比,我的小型项目的打包开销是最小的.

I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.

有很多可以改进的地方——例如,知道何时创建一个新的 zip 文件,将其拆分为两个 zip 文件,一个包含经常更改的私有包,一个包含很少更改的依赖项,这不会't需要经常重建.在重建 zip 之前,您可能会更聪明地检查文件更改.检查论点的有效性也是一个好主意.不过现在这对我的目的来说已经足够了.

There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.

我提出的解决方案并不是专门为像 NumPy 这样的大规模依赖而设计的(尽管它可能对它们有用).此外,如果您正在构建基于 C 的扩展,并且您的驱动程序节点与集群节点具有不同的体系结构,它将无法工作.

The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.

我在其他地方看到过一些建议,因为它已经包含 NumPy (和许多其他软件包),这可能是获取 NumPy 以及其他基于 C 的扩展的更好方法去.无论如何,我们不能总是期望 Anaconda 在正确的版本中拥有我们想要的 PyPI 包,此外,您可能无法控制 Spark 环境以将 Anaconda 放在上面,所以我认为这个基于 virtualenv方法还是有用的.

I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

这篇关于在 Spark 执行器节点上安装 Python 依赖项的最简单方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆