忽略/跳过不存在的GCS输入文件 [英] Ignore/skip GCS input files that don't exist

查看:60
本文介绍了忽略/跳过不存在的GCS输入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的要求是处理Google DFP直接将其写入GCS存储桶的最近24小时的日志保存.

Our requirement is to process the last 24 hours of adserving logs that Google DFP writes directly to our GCS bucket.

当前,我们通过使用Flatten并传入过去24小时内的所有文件名来实现此目的.文件名采用yyyyMMdd_hh格式.

We currently achieve this by using a Flatten, and passing in all the file names for the last 24 hours. The file names are in yyyyMMdd_hh format.

但是,我们发现有时DFP在某些小时内无法写入文件.我们已经向DFP广告管理系统专家提出了此问题.

But, we've identified that sometimes DFP fails to write a file(s) for some of the hours. We've raised that issue to the DFP guys.

但是,有没有一种方法可以配置我们的数据流作业以忽略任何丢失的GCS文件,并且在这种情况下不会失败?如果一个或多个文件不存在,则当前失败.

However, is there a way to configure our Dataflow job to ignore any missing GCS files, and not fail in that case? It currently fails if one or more files don't exist.

推荐答案

使用TextIO.ReadAvroIO.Read之类的Dataflow API从不存在的文件中读取数据,当然会引发错误并导致管道失败.这正在按预期方式工作,我想不出解决方法.

Using Dataflow APIs like TextIO.Read or AvroIO.Read to read from a non-existent file will, of course, thrown an error and cause the pipeline to fail. This is working as intended and I cannot think of a workaround.

现在,从类似yyyyMMdd_*的文件模式中读取可能至少部分解决了您的问题.数据流会将文件模式扩展为一组文件并进行处理.只要存在至少一个与提供的模式匹配的文件,管道就应该继续.

Now, reading from a filepattern like yyyyMMdd_* may solve your problem, at least partially. Dataflow will expand the filepattern into a set of files and process them. As long as at least one file exists that matches the pattern provided, the pipeline should proceed.

每个文件只有一个源的方法通常是一种反模式-效率较低,较不美观,但功能相同.尽管如此,在构建数据流管道以确认每个文件的存在之前,您仍然可以使用Google Cloud Storage API对其进行修复.如果不存在输入文件,则可以直接跳过生成源之一.

The approach of having one source per file is often an anti-pattern -- it is less efficient and less elegant, but functionally the same. Nevertheless, you can still fix it by using the Google Cloud Storage API before constructing your Dataflow pipeline to confirm presence of each file. If an input file is not present, you can simply skip generating one of the sources.

无论哪种方式,请记住提供的最终一致性保证GCS list API .这意味着扩展文件模式可能不会立即生成所有本来可以读取的文件.但是,在这种情况下,反模式可能是一个不错的解决方法.

Either way, please keep in mind the eventual consistency guarantee provided by the GCS list API. This means that expanding a file pattern may not immediately generate all files that would otherwise be readable. The anti-pattern may be a good workaround for this case, however.

这篇关于忽略/跳过不存在的GCS输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆