如何从Google Cloud DataFlow作业中的GCS读取Blob(斑点)文件? [英] How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?
本文介绍了如何从Google Cloud DataFlow作业中的GCS读取Blob(斑点)文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我尝试远程运行将使用pickle文件的DataFlow管道. 在本地,我可以使用下面的代码来调用文件.
I try to run a DataFlow pipeline remotely which will use a pickle file. Locally, I can use the code below to invoke the file.
with open (known_args.file_path, 'rb') as fp:
file = pickle.load(fp)
但是,当路径关于云存储(gs://...)时,我发现它无效:
However, I find it not valid when the path is about cloud storage(gs://...):
IOError: [Errno 2] No such file or directory: 'gs://.../.pkl'
我有点理解为什么它不起作用,但是我找不到正确的方法来做到这一点.
I kind of understand why it is not working but I cannot find the right way to do it.
推荐答案
如果您的GCS存储桶中有咸菜文件,则可以将其作为BLOB加载,并像在代码中一样对其进行进一步处理(使用pickle.load()
):
If you have pickle files in your GCS bucket, then you can load them as BLOBs and process them further like in your code (using pickle.load()
):
class ReadGcsBlobs(beam.DoFn):
def process(self, element, *args, **kwargs):
from apache_beam.io.gcp import gcsio
gcs = gcsio.GcsIO()
yield (element, gcs.open(element).read())
# usage example:
files = (p
| "Initialize" >> beam.Create(["gs://your-bucket-name/pickle_file_path.pickle"])
| "Read blobs" >> beam.ParDo(ReadGcsBlobs())
)
这篇关于如何从Google Cloud DataFlow作业中的GCS读取Blob(斑点)文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文