如何在 Google Cloud DataFlow 作业中从 GCS 读取 blob(pickle)文件? [英] How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?

查看:22
本文介绍了如何在 Google Cloud DataFlow 作业中从 GCS 读取 blob(pickle)文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试远程运行 DataFlow 管道,该管道将使用 pickle 文件.在本地,我可以使用下面的代码来调用文件.

I try to run a DataFlow pipeline remotely which will use a pickle file. Locally, I can use the code below to invoke the file.

with open (known_args.file_path, 'rb') as fp:
     file = pickle.load(fp)

但是,当路径是关于云存储时,我发现它无效(gs://...):

However, I find it not valid when the path is about cloud storage(gs://...):

IOError: [Errno 2] No such file or directory: 'gs://.../.pkl'

我有点理解为什么它不起作用,但我找不到正确的方法来做到这一点.

I kind of understand why it is not working but I cannot find the right way to do it.

推荐答案

如果您的 GCS 存储桶中有 pickle 文件,那么您可以将它们加载为 BLOB 并像在您的代码中一样进一步处理它们(使用 pickle.load()):

If you have pickle files in your GCS bucket, then you can load them as BLOBs and process them further like in your code (using pickle.load()):

class ReadGcsBlobs(beam.DoFn):
    def process(self, element, *args, **kwargs):
        from apache_beam.io.gcp import gcsio
        gcs = gcsio.GcsIO()
        yield (element, gcs.open(element).read())


# usage example:
files = (p
         | "Initialize" >> beam.Create(["gs://your-bucket-name/pickle_file_path.pickle"])
         | "Read blobs" >> beam.ParDo(ReadGcsBlobs())
        )

这篇关于如何在 Google Cloud DataFlow 作业中从 GCS 读取 blob(pickle)文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆