如何从Google Cloud DataFlow作业中的GCS读取Blob(斑点)文件? [英] How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?

查看:88
本文介绍了如何从Google Cloud DataFlow作业中的GCS读取Blob(斑点)文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试远程运行将使用pickle文件的DataFlow管道. 在本地,我可以使用下面的代码来调用文件.

I try to run a DataFlow pipeline remotely which will use a pickle file. Locally, I can use the code below to invoke the file.

with open (known_args.file_path, 'rb') as fp:
     file = pickle.load(fp)

但是,当路径关于云存储(gs://...)时,我发现它无效:

However, I find it not valid when the path is about cloud storage(gs://...):

IOError: [Errno 2] No such file or directory: 'gs://.../.pkl'

我有点理解为什么它不起作用,但是我找不到正确的方法来做到这一点.

I kind of understand why it is not working but I cannot find the right way to do it.

推荐答案

如果您的GCS存储桶中有咸菜文件,则可以将其作为BLOB加载,并像在代码中一样对其进行进一步处理(使用pickle.load()):

If you have pickle files in your GCS bucket, then you can load them as BLOBs and process them further like in your code (using pickle.load()):

class ReadGcsBlobs(beam.DoFn):
    def process(self, element, *args, **kwargs):
        from apache_beam.io.gcp import gcsio
        gcs = gcsio.GcsIO()
        yield (element, gcs.open(element).read())


# usage example:
files = (p
         | "Initialize" >> beam.Create(["gs://your-bucket-name/pickle_file_path.pickle"])
         | "Read blobs" >> beam.ParDo(ReadGcsBlobs())
        )

这篇关于如何从Google Cloud DataFlow作业中的GCS读取Blob(斑点)文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆