使用AWS Lambda(Python 3)读取存储在S3中的Parquet文件 [英] Read Parquet file stored in S3 with AWS Lambda (Python 3)
问题描述
我正在尝试使用AWS Lambda在S3中加载,处理和编写Parquet文件.我的测试/部署过程是:
I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:
- https://github.com/lambci/docker-lambda 作为要模拟的容器亚马逊环境,因为需要安装本机库(包括numpy等).
- 生成zip文件的过程如下:
- https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others).
- This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python
- Add a test python function to the zip, send it to S3, update the lambda and test it
似乎有两种可能的方法,都在docker容器本地工作:
It seems that there are two possible approaches, which both work locally to the docker container:
- 带有s3fs的fastparquet:不幸的是,该软件包的未压缩大小大于256MB,因此我无法使用它更新Lambda代码. 带有s3fs的
-
pyarrow:我遵循了 https://github.com/apache/arrow/pull/916 ,当使用lambda函数执行时,我会得到:
- fastparquet with s3fs: Unfortunately the unzipped size of the package is bigger than 256MB and therefore I can't update the Lambda code with it.
pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:
- 如果我以S3或S3N 作为URI的前缀(如代码示例所示):在pyarrow/parquet.py的Lambda环境
OSError: Passed non-file path: s3://mybucket/path/to/myfile
中,第848行.在本地,我得到IndexError: list index out of range
在pyarrow/parquet.py中,第714行 - 如果我没有在URI前面加上S3或S3N :它在本地有效(我可以读取镶木地板数据).在Lambda环境中,我在pyarrow/parquet.py的第848行中得到了相同的
OSError: Passed non-file path: s3://mybucket/path/to/myfile
.
- If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment
OSError: Passed non-file path: s3://mybucket/path/to/myfile
in pyarrow/parquet.py, line 848. Locally I getIndexError: list index out of range
in pyarrow/parquet.py, line 714 - If I don't prefix the URI with S3 or S3N: It works locally (I can read the parquet data). In the Lambda environment, I get the same
OSError: Passed non-file path: s3://mybucket/path/to/myfile
in pyarrow/parquet.py, line 848.
我的问题是:
- 为什么在docker容器中得到的结果与在Lambda环境中得到的结果不同?
- 提供URI的正确方法是什么?
- 是否存在通过AWS Lambda读取S3中的Parquet文件的可接受方法?
谢谢!
推荐答案
我能够使用fastparquet完成将镶木地板文件写入S3的操作.有点棘手,但是当我意识到要整合所有依赖项时,我不得不使用Lambda所使用的完全相同的Linux.
I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.
这是我的做法:
来源: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html
Linux image: https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2
注意:您可能需要安装许多软件包,并将python版本更改为3.6,因为该Linux并非用于开发.这是我寻找包裹的方法:
Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:
sudo yum list | grep python3
我已安装:
python36.x86_64
python36-devel.x86_64
python36-libs.x86_64
python36-pip.noarch
python36-setuptools.noarch
python36-tools.x86_64
2.使用此处的说明构建了一个zip文件,其中包含我的脚本将所有依赖关系都用于将其全部转储到文件夹中并使用以下命令对其进行压缩的所有依赖关系:
mkdir parquet
cd parquet
pip install -t . fastparquet
pip install -t . (any other dependencies)
copy my python file in this folder
zip and upload into Lambda
注意:我必须解决一些约束:Lambda不允许您上传大于50M的zip并解压缩大于260M的文件.如果有人知道将依赖项引入Lambda的更好方法,请共享.
Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.
来源: 将实木复合地板从AWS Kinesis Firehose写入AWS S3
这篇关于使用AWS Lambda(Python 3)读取存储在S3中的Parquet文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!