如何使用Cloud Function触发器将GCS存储桶中的多个文件合并 [英] How to combine multiple files in GCS bucket with Cloud Function trigger

查看:52
本文介绍了如何使用Cloud Function触发器将GCS存储桶中的多个文件合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每个日期的每个日期我都有3个文件,格式如下:"nameXX_date",这是一个示例:'名称XX_01-01-20''名称XY_01-01-20''nameXZ_01-01-20'

I have 3 files per date per name in this format: 'nameXX_date', here's an example: 'nameXX_01-01-20' 'nameXY_01-01-20' 'nameXZ_01-01-20'

其中名称"可以是任何东西,日期是上传文件的日期(几乎每天).

where 'name' can be anything, and the date is whatever day the file was uploaded (almost every day).

我需要编写一个云函数,每当一个新文件进入存储桶时就会触发,该函数将3个XX,XY,XZ文件合并为一个文件名="name_date"的文件.

I need to write a cloud function that triggers whenever a new file lands in the bucket, that combines the 3 XX,XY,XZ files into one file with filename = "name_date".

这是到目前为止我得到的:

Here's what I've got so far:


bucket_id = 'bucketname'
client = gcs.Client()
bucket = client.get_bucket(bucket_id)

name = 
date =
outfile = f'bucketname/{name}_{date}.CSV'

blobs = []
for shard in ('XX', 'XY', 'XZ'):
    sfile = f'{name}{shard}_{date}'
    blob = bucket.blob(sfile)
    if not blob.exists():
        # this causes a retry in 60s
        raise ValueError(f'branch {sfile} not present')
    blobs.append(blob)
bucket.blob(outfile).compose(blobs)
logging.info(f'Successfullt created {outfile}')
for blob in blobs:
    blob.delete()
logging.info('Deleted {} blobs'.format(len(blobs)))

我面临的问题是我不确定如何获取存储在存储桶中的新文件的名称和日期,以便我可以找到其他两个匹配的文件并将其合并

The issue I'm facing is that I'm not sure how to get the name and date of the new file that landed in the bucket, so that I can find the other 2 matching files and combine them

顺便说一句,我已经从这篇文章中获得了这段代码,我正在尝试在此处实现它:

Btw, I've got this code from this article and I'm trying to implement it here: https://medium.com/google-cloud/how-to-write-to-a-single-shard-on-google-cloud-storage-efficiently-using-cloud-dataflow-and-cloud-3aeef1732325

推荐答案

据我了解,云功能是由特定GCS中某个对象上的 google.storage.object.finalize 事件触发的桶.

As I understand, the cloud function is triggered by a google.storage.object.finalize event on an object in the specific GCS bucket.

在这种情况下,您的云功能签名"看起来像(摘自您提到的中"文章):

In that case your cloud function "signature" looks like (taken from the "medium" article you mentioned):

def compose_shards(data, context):

data 是一本词典,其中详细介绍了有关对象(文件)的详细信息.在此处查看一些详细信息: Google云存储触发器

The data is a dictionary with plenty of details about the object (file) has been finalized. See some details here: Google Cloud Storage Triggers

例如, data ["name"] -是正在讨论的对象的名称.

For example, the data["name"] - is the name of the object under discussion.

如果您知道用于命名这些对象/碎片的模式/模板/规则,则可以从对象/碎片名称中提取相关元素,然后使用其来组成目标对象/文件名称.

If you know the pattern/template/rule according to which those objects/shards are named, you can extract the relevant elements from an object/shard name, and use it to compose the target object/file name.

这篇关于如何使用Cloud Function触发器将GCS存储桶中的多个文件合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆