如何将pip/pypi安装的python软件包转换为zip文件以在AWS Glue中使用 [英] How to turn pip / pypi installed python packages into zip files to be used in AWS Glue

查看:121
本文介绍了如何将pip/pypi安装的python软件包转换为zip文件以在AWS Glue中使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用AWS Glue和PySpark ETL脚本,并希望将辅助库(例如google_cloud_bigquery)用作我的PySpark脚本的一部分.

I am working with AWS Glue and PySpark ETL scripts, and want to use auxiliary libraries such as google_cloud_bigquery as a part of my PySpark scripts.

文档指出,这应该有可能. 之前的堆栈溢出讨论,尤其是其中一个答案中的一条评论似乎提供了进一步的证明.但是,怎么做对我来说还不清楚.

The documentation states this should be possible. This previous Stack Overflow discussion, especially one comment in one of the answers seems to provide additional proof. However, how to do it is unclear to me.

所以目标是将pip install ed软件包转换为一个或多个zip文件,以便能够将软件包托管在S3上并像这样指向它们:

So the goal is to turn the pip installed packages into one or more zip files, in order to be able to just host the packages on S3 and point to them like so:

s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip

应该如何做 我所看过的地方都没有明确说明.

How that should be done is not clearly stated anywhere I've looked.

即我如何pip install一个软件包,然后将其转换为一个zip文件,我可以将其上传到S3,以便PySpark可以将其与这样的S3 URL一起使用?

i.e. how do I pip install a package and then turn it into a zip file that I can upload to S3 so PySpark can use it with such an S3 URL?

通过使用命令pip download,我已经能够提取这些库,但是默认情况下它们不是.zip文件,而是.whl文件或.tar.gz

By using the command pip download I have been able to fetch the libs, but they are not .zip files by default but instead either .whl files or .tar.gz

.. so不确定如何将它们转换为AWS Glue可以消化的zip文件.也许使用.tar.gz,我可以先tar -xf将它们备份,然后再zip将它们备份,但是whl文件呢?

..so not sure what to do to turn them into zip files that AWS Glue can digest. Maybe with .tar.gz I could just tar -xf them and then zip them back up, but how about whl files?

推荐答案

因此,在阅读了过去48个小时我在评论中提供的材料之后,下面就是我解决该问题的方法.

So, after going through the materials I sourced in the comments over the past 48 hours, here's how I solved the issue.

注意:我使用Python2.7,因为这就是AWS Glue附带的东西.

Note: I use Python2.7 because that's what AWS Glue seems to ship with.

按照 E中的说明进行操作. Kampf的博客文章编写生产级PySpark作业的最佳实践"

By following the instructions in E. Kampf's blog post "Best Practices Writing Production-Grade PySpark Jobs" and this stack overflow answer, and some tweaking due to random errors along the way, I did the following:

  1. 创建一个名为ziplib的新项目文件夹,并使用cd进入其中:

mkdir ziplib && cd ziplib

  1. 创建一个requirements.txt文件,每行的包名称.

  1. Create a requirements.txt file with names of packages on each row.

在其中创建一个名为deps的文件夹:

Create a folder in it called deps:

mkdir deps

  1. 在当前文件夹中使用python 2.7创建一个新的virtualenv环境:

virtualenv -p python2.7 .

  1. 使用绝对路径将需求安装到dep文件夹中(否则将不起作用):

bin/pip2.7 install -r requirements.txt --install-option --install-lib="/absolute/path/to/.../ziplib/deps"

  1. cd进入deps文件夹,并将其内容压缩到父文件夹中的zip归档文件deps.zip中,然后cd脱离deps文件夹:

cd deps && zip -r ../deps.zip . && cd ..

..所以现在我有了一个zip文件,如果将它放到AWS S3上并从AWS Glue上的PySpark指向它,它似乎可以正常工作.

..and so now I have a zip file which if I put onto AWS S3 and point it to from PySpark on AWS Glue, it seems to work.

如何 ...我无法解决的问题是,由于某些软件包(例如Google Cloud Python客户端库)使用了称为隐式命名空间包(PEP-420),它们通常没有__init__.py文件出现在模块中,因此导入语句不起作用.我在这里茫然.

HOWEVER... what I haven't been able to solve is that since some packages, such as the Google Cloud Python client libs, use what is known as Implicit Namespace Packages (PEP-420), they don't have the __init__.py files usually present in modules, and thus the import statements don't work. I'm at a loss here.

这篇关于如何将pip/pypi安装的python软件包转换为zip文件以在AWS Glue中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆