在 Spark 执行器节点上安装 Python 依赖项的最简单方法? [英] Easiest way to install Python dependencies on Spark executor nodes?

查看:28
本文介绍了在 Spark 执行器节点上安装 Python 依赖项的最简单方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道您可以将单独的文件作为 Python Spark 程序的依赖项发送.但是成熟的库(例如 numpy)呢?

I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?

Spark 是否可以使用提供的包管理器(例如 pip)来安装库依赖项?还是必须在执行 Spark 程序之前手动完成?

Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?

如果答案是手动的,那么在大量分布式节点上同步库(安装路径、版本等)的最佳实践"方法是什么?

If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?

推荐答案

实际上已经尝试过,我认为我作为评论发布的链接并不能完全满足您对依赖项的要求.您相当合理地要求的是一种让 Spark 在安装依赖项方面与 setuptools 和 pip 完美配合的方法.这让我大吃一惊,这在 Spark 中并没有得到更好的支持.第三方依赖问题在通用 Python 中基本上得到了解决,但在 Spark 下,似乎假设您将回到手动依赖管理或其他方面.

Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.

我一直在使用基于 virtualenv 的不完善但功能齐全的管道.基本思路是

I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is

  1. 纯粹为您的 Spark 节点创建一个 virtualenv
  2. 每次运行 Spark 作业时,请运行您自己的所有内部 Python 库的全新 pip install.如果您使用 setuptools 设置了这些,这将安装它们的依赖项
  3. 压缩 virtualenv 的 site-packages 目录.这将包括您的库及其依赖项,工作节点将需要这些库,但不包括他们已经拥有的标准 Python 库
  4. 将单个 .zip 文件传递​​给 --py-files
  1. Create a virtualenv purely for your Spark nodes
  2. Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies
  3. Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have
  4. Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files

当然,您希望编写一些帮助脚本来管理此过程.这是一个改编自我一直在使用的帮助脚本,它无疑可以改进很多:

Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:

#!/usr/bin/env bash
# helper script to fulfil Spark's python packaging requirements.
# Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of
# supplied to --py-files argument of `pyspark` or `spark-submit`
# First argument should be the top-level virtualenv
# Second argument is the zipfile which will be created, and
#   which you can subsequently supply as the --py-files argument to 
#   spark-submit
# Subsequent arguments are all the private packages you wish to install
# If these are set up with setuptools, their dependencies will be installed

VENV=$1; shift
ZIPFILE=$1; shift
PACKAGES=$*

. $VENV/bin/activate
for pkg in $PACKAGES; do
  pip install --upgrade $pkg
done
TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid clashes with other processes
( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . )
mv $TMPZIP $ZIPFILE

我有一组其他简单的包装器脚本,用于提交我的 Spark 作业.我只是首先调用此脚本作为该过程的一部分,并确保在运行 spark-submit 时将第二个参数(zip 文件的名称)作为 --py-files 参数传递(如评论中所述).我总是运行这些脚本,所以我永远不会意外运行旧代码.与 Spark 开销相比,我的小规模项目的打包开销最小.

I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.

可以进行大量改进——例如,明智地决定何时创建一个新的 zip 文件,将其拆分为两个 zip 文件,一个包含经常更改的私有包,一个包含很少更改的依赖项,这不会不需要经常重建.在重建 zip 之前,您可以更聪明地检查文件更改.检查参数的有效性也是一个好主意.但是现在这足以满足我的目的.

There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.

我提出的解决方案并不是专门为像 NumPy 这样的大规模依赖项而设计的(尽管它可能对他们有用).此外,如果您正在构建基于 C 的扩展,并且您的驱动程序节点与您的集群节点具有不同的体系结构,那么它也将不起作用.

The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.

我在其他地方看到过一些建议,即在您的所有节点上运行像 Anaconda 这样的 Python 发行版,因为它已经包含 NumPy(和 许多其他软件包),这可能是获得 NumPy 以及其他基于 C 的扩展的更好方法去.无论如何,我们不能总是期望 Anaconda 拥有我们想要的 PyPI 包在正确的版本中,此外您可能无法控制您的 Spark 环境来将 Anaconda 放在上面,所以我认为这个基于 virtualenv方法还是有帮助的.

I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

这篇关于在 Spark 执行器节点上安装 Python 依赖项的最简单方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆