使用 Python 向 Google Cloud Storage 写入流式传输 [英] Write-streaming to Google Cloud Storage in Python

查看:46
本文介绍了使用 Python 向 Google Cloud Storage 写入流式传输的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将用 Python 编写的 AWS Lambda 函数迁移到 CF

  1. 即时解压缩并逐行读取
  2. 对每一行进行一些灯光变换
  3. 将未压缩的输出(一次一行或多个块)写入 GCS

输出大于 2GB - 但略小于 3GB,因此它适合 Lambdajust.

嗯,这似乎是不可能的,或者更多地涉及到 GCP:

  • 未压缩的文件无法放入内存或 /tmp - 在撰写本文时限制为 2048MB - 所以 Python 客户端库 upload_from_file(或 _filename)无法使用
  • this 官方文件,但令我惊讶的是,它指的是 boto,一个最初为 AWS S3 设计的库,并且由于 boto3 已经存在一段时间了,所以它已经过时了.没有真正的 GCP 方法来流式写入或读取
  • Node.js 有一个简单的 createWriteStream() - 不错的文章 这里 btw - 但在 Python 中没有等效的单行代码
  • 可恢复媒体上传声音喜欢它,但在 Node 中处理的东西很多代码更容易
  • AppEngine 有 cloudstorage 但在它之外不可用 - 并且已过时
  • 在工作包装器上几乎没有示例,用于逐行编写文本/纯文本数据,就好像 GCS 是本地文件系统一样.这不仅限于 Cloud Functions 和 Python 客户端库的缺失功能,但由于资源限制,它在 CF 中更为严重.顺便说一句,我参与了 讨论 以添加可写的 IOBase 函数,但是它没有牵引力.
  • 显然使用 VM 或 DataFlow 对手头的任务来说是不可能的.

在我看来,从基于云的存储中读取/写入的流(或类似流)甚至应该包含在 Python 标准库中.

按照当时的建议,仍然可以使用 GCFS,它在幕后提交了在您将内容写入 FileObj 时为您分块上传.同一团队编写了 s3fs.我不知道 Azure.

AFAIC,我会坚持使用 AWS Lambda,因为输出可以容纳在内存中 - 目前 - 但分段上传是支持任何输出大小且内存最少的方法.p>

想法或替代方案?

解决方案

我对 multipartresumable 上传感到困惑.后者是您流式传输"所需要的——它实际上更像是上传缓冲流的块.

Multipart上传是在同一个API调用中一次加载数据和自定义元数据.

虽然我非常喜欢 GCFS - Martin,他的主要贡献者非常有反应——我最近发现 一种替代方法,它使用 google-resumable-media 库.

GCFS 建立在核心 http API 之上,而 Seth 的解决方案使用由 Google 维护的低级库,与 API 更改更加同步,其中包括指数备份.后者对于大/长流来说确实是必须的,因为连接可能会中断,即使在 GCP 内也是如此 - 我们遇到了 GCF 的问题.

最后,我仍然相信 GoogleCloud Library 是添加类似流的功能的正确位置,具有基本的 writeread.它具有 核心代码已经.

如果您也对核心库中的该功能感兴趣,请点赞这里 - 假设优先级基于此.

I am trying to migrate an AWS Lambda function written in Python to CF that

  1. unzips on-the-fly and read line-by-line
  2. performs some light transformations on each line
  3. write output (a line at a time or chunks) uncompressed to GCS

The output is > 2GB - but slightly less than 3GB so it fits in Lambda, just.

Well, it seems impossible or way more involved in GCP:

  • uncompressed cannot fit in memory or /tmp - limited to 2048MB as of writing this - so Python Client lib upload_from_file (or _filename) cannot be used
  • there is this official paper but to my surprise, it's referring to boto, a library initially designed for AWS S3, and a quite outdated one since boto3 is out for some time. No genuine GCP method to stream write or read
  • Node.js has a simple createWriteStream() - nice article here btw - but no equivalent one-liner in Python
  • Resumable media upload sounds like it but lot of code for something handled in Node much easier
  • AppEngine had cloudstorage but not available outside of it - and obsolete
  • little to no example out there on a working wrapper for writing text/plain data line-by-line as if GCS was a local filesystem. This is not limited to Cloud Functions and a lacking feature of the Python Client library, but it is more acute in CF due the resource constraints. Btw, I was part of a discussion to add a writeable IOBase function but it had no traction.
  • obviously using a VM or DataFlow are out of question for the task at hand.

In my mind, stream (or stream-like) reading/writing from cloud-based storage should even be included in the Python standard library.

As recommended back then, one can still use GCSFS, which behind the scenes commits the upload in chunks for you while you are writing stuff to a FileObj. The same team wrote s3fs. I don't know for Azure.

AFAIC, I will stick to AWS Lambda as the output can fit in memory - for now - but multipart upload is the way to go to support any output size with a minimum of memory.

Thoughts or alternatives ?

解决方案

I got confused with multipart vs. resumable upload. The latter is what you need for "streaming" - it's actually more like uploading chunks of a buffered stream.

Multipart upload is to load data and custom metadata at once, in the same API call.

While I like GCSFS very much - Martin, his main contributor is very responsive -, I recently found an alternative that uses the google-resumable-media library.

GCSFS is built upon the core http API whereas Seth's solution uses a low-level library maintained by Google, more in sync with API changes and which includes exponential backup. The latter is really a must for large/long stream as connection may drop, even within GCP - we faced the issue with GCF.

On a closing note, I still believe that the Google Cloud Library is the right place to add stream-like functionality, with basic write and read. It has the core code already.

If you too are interested in that feature in the core lib, thumbs up the issue here - assuming priority is based thereon.

这篇关于使用 Python 向 Google Cloud Storage 写入流式传输的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆