如何将pip/pypi安装的python软件包转换为zip文件以在AWS Glue中使用 [英] How to turn pip / pypi installed python packages into zip files to be used in AWS Glue
问题描述
我正在使用AWS Glue和PySpark ETL脚本,并希望将辅助库(例如google_cloud_bigquery
)用作我的PySpark脚本的一部分.
I am working with AWS Glue and PySpark ETL scripts, and want to use auxiliary libraries such as google_cloud_bigquery
as a part of my PySpark scripts.
文档指出,这应该有可能. 之前的堆栈溢出讨论,尤其是其中一个答案中的一条评论似乎提供了进一步的证明.但是,怎么做对我来说还不清楚.
The documentation states this should be possible. This previous Stack Overflow discussion, especially one comment in one of the answers seems to provide additional proof. However, how to do it is unclear to me.
所以目标是将pip install
ed软件包转换为一个或多个zip文件,以便能够将软件包托管在S3上并像这样指向它们:
So the goal is to turn the pip install
ed packages into one or more zip files, in order to be able to just host the packages on S3 and point to them like so:
s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
应该如何做 我所看过的地方都没有明确说明.
How that should be done is not clearly stated anywhere I've looked.
即我如何pip install
一个软件包,然后将其转换为一个zip文件,我可以将其上传到S3,以便PySpark可以将其与这样的S3 URL一起使用?
i.e. how do I pip install
a package and then turn it into a zip file that I can upload to S3 so PySpark can use it with such an S3 URL?
通过使用命令pip download
,我已经能够提取这些库,但是默认情况下它们不是.zip文件,而是.whl文件或.tar.gz
By using the command pip download
I have been able to fetch the libs, but they are not .zip files by default but instead either .whl files or .tar.gz
.. so不确定如何将它们转换为AWS Glue可以消化的zip文件.也许使用.tar.gz,我可以先tar -xf
将它们备份,然后再zip
将它们备份,但是whl文件呢?
..so not sure what to do to turn them into zip files that AWS Glue can digest. Maybe with .tar.gz I could just tar -xf
them and then zip
them back up, but how about whl files?
推荐答案
因此,在阅读了过去48个小时我在评论中提供的材料之后,下面就是我解决该问题的方法.
So, after going through the materials I sourced in the comments over the past 48 hours, here's how I solved the issue.
注意:我使用Python2.7,因为这就是AWS Glue附带的东西.
Note: I use Python2.7 because that's what AWS Glue seems to ship with.
按照 E中的说明进行操作. Kampf的博客文章编写生产级PySpark作业的最佳实践" 和
By following the instructions in E. Kampf's blog post "Best Practices Writing Production-Grade PySpark Jobs" and this stack overflow answer, and some tweaking due to random errors along the way, I did the following:
- 创建一个名为ziplib的新项目文件夹,并使用cd进入其中:
mkdir ziplib && cd ziplib
-
创建一个
requirements.txt
文件,每行的包名称.
Create a
requirements.txt
file with names of packages on each row.
在其中创建一个名为deps的文件夹:
Create a folder in it called deps:
mkdir deps
- 在当前文件夹中使用python 2.7创建一个新的virtualenv环境:
virtualenv -p python2.7 .
- 使用绝对路径将需求安装到dep文件夹中(否则将不起作用):
bin/pip2.7 install -r requirements.txt --install-option --install-lib="/absolute/path/to/.../ziplib/deps"
- cd进入deps文件夹,并将其内容压缩到父文件夹中的zip归档文件deps.zip中,然后cd脱离deps文件夹:
cd deps && zip -r ../deps.zip . && cd ..
..所以现在我有了一个zip文件,如果将它放到AWS S3上并从AWS Glue上的PySpark指向它,它似乎可以正常工作.
..and so now I have a zip file which if I put onto AWS S3 and point it to from PySpark on AWS Glue, it seems to work.
如何 ...我无法解决的问题是,由于某些软件包(例如Google Cloud Python客户端库)使用了称为隐式命名空间包(PEP-420),它们通常没有__init__.py
文件出现在模块中,因此导入语句不起作用.我在这里茫然.
HOWEVER... what I haven't been able to solve is that since some packages, such as the Google Cloud Python client libs, use what is known as Implicit Namespace Packages (PEP-420), they don't have the __init__.py
files usually present in modules, and thus the import statements don't work. I'm at a loss here.
这篇关于如何将pip/pypi安装的python软件包转换为zip文件以在AWS Glue中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!