如何指定数据流作业的多个输入路径 [英] How to specify multiple input paths to a Dataflow job

查看：90 发布时间：2020/11/16 0:46:46 java glob google-cloud-dataflow

本文介绍了如何指定数据流作业的多个输入路径的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在来自Google Cloud Storage的多个输入上运行数据流作业，但是仅使用* glob运算符不能指定我要传递给作业的路径.

I want to run a Dataflow job over multiple inputs from Google Cloud Storage, but the paths I want to pass to the job can't be specified with just the * glob operator.

考虑以下路径:

gs://bucket/some/path/20160208/input1
gs://bucket/some/path/20160208/input2
gs://bucket/some/path/20160209/input1
gs://bucket/some/path/20160209/input2
gs://bucket/some/path/20160210/input1
gs://bucket/some/path/20160210/input2
gs://bucket/some/path/20160211/input1
gs://bucket/some/path/20160211/input2
gs://bucket/some/path/20160212/input1
gs://bucket/some/path/20160212/input2

我希望我的工作可以处理20160209，20160210和20160211目录中的文件，但不能用于20160208(第一个)和20160212(最后一个).实际上，还有更多的日期，我希望能够为工作指定任意日期范围.

I want my job to work on the files in the 20160209, 20160210 and 20160211 directories, but not on 20160208 (the first) and 20160212 (the last). In reality there's a lot of more dates, and I want to be able to specify an arbitrary range of dates for my job to work on.

The docs for TextIO.Read say:

支持标准Java Filesystem全局模式("*"，?"，"[..]").

Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.

但是我无法使它正常工作.有一个指向 Java文件系统全局模式的链接，该链接又会链接到 getPathMatcher (字符串)，其中列出了所有的通配符选项.其中之一是{a,b,c}，它看起来完全符合我的需求，但是，如果将gs://bucket/some/path/201602{09,10,11}/*传递给TextIO.Read#from，则会显示无法扩展文件模式".

But I can't get this to work. There's a link to Java Filesystem glob patterns , which in turn links to getPathMatcher(String), that lists all the globbing options. One of them is {a,b,c}, which looks exactly like what I need, however, if I pass gs://bucket/some/path/201602{09,10,11}/* to TextIO.Read#from I get "Unable to expand file pattern".

也许文档的意思是仅支持 *，?和[…]，如果是这种情况，我该如何构造一个Dataflow将接受的glob，并且可以匹配我上面所述的任意日期范围?

Maybe the docs mean that only *, ? and […] are supported, and if that is the case, how can I construct a glob that Dataflow will accept and that can match an arbitrary date range like the one I describe above?

更新:我发现我可以编写一段代码，以便我可以将路径前缀作为逗号分隔的列表传递，并从每个路径创建一个输入.使用Flatten转换，但这似乎是一种非常低效的方式.第一步似乎要读取所有输入文件，然后立即将它们再次写出到GCS上的临时位置.仅当所有输入都已被读取和写入时，实际处理才开始.在我正在写的工作中，这一步骤是完全不必要的.我希望作业读取第一个文件，开始处理它，然后读取下一个文件，依此类推.这只是造成了很多其他问题，我将尽力使它工作，但是感觉就像死胡同了.因为最初的重写.

Update: I've figured out that I can write a chunk of code to so that I can pass in the path prefixes as a comma separated list, create an input from each and use the Flatten transform, but that seems like a very inefficient way of doing it. It looks like the first step reads all input files and immediately write them out again to the temporary location on GCS. Only when all the inputs have been read and written the actual processing starts. This step is completely unnecessary in the job I'm writing. I want the job to read the first file, start processing it and read the next, and so on. This just caused a ton other problems, I'll try to make it work, but it feels like a dead end because of the initial rewriting.

如何指定数据流作业的多个输入路径 [英] How to specify multiple input paths to a Dataflow job

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何指定数据流作业的多个输入路径 [英] How to specify multiple input paths to a Dataflow job

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭