如何指定数据流作业的多个输入路径 [英] How to specify multiple input paths to a Dataflow job

查看:90
本文介绍了如何指定数据流作业的多个输入路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在来自Google Cloud Storage的多个输入上运行数据流作业,但是仅使用* glob运算符不能指定我要传递给作业的路径.

I want to run a Dataflow job over multiple inputs from Google Cloud Storage, but the paths I want to pass to the job can't be specified with just the * glob operator.

考虑以下路径:

gs://bucket/some/path/20160208/input1
gs://bucket/some/path/20160208/input2
gs://bucket/some/path/20160209/input1
gs://bucket/some/path/20160209/input2
gs://bucket/some/path/20160210/input1
gs://bucket/some/path/20160210/input2
gs://bucket/some/path/20160211/input1
gs://bucket/some/path/20160211/input2
gs://bucket/some/path/20160212/input1
gs://bucket/some/path/20160212/input2

我希望我的工作可以处理201602092016021020160211目录中的文件,但不能用于20160208(第一个)和20160212(最后一个).实际上,还有更多的日期,我希望能够为工作指定任意日期范围.

I want my job to work on the files in the 20160209, 20160210 and 20160211 directories, but not on 20160208 (the first) and 20160212 (the last). In reality there's a lot of more dates, and I want to be able to specify an arbitrary range of dates for my job to work on.

The docs for TextIO.Read say:

支持标准Java Filesystem全局模式("*",?","[..]").

Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.

但是我无法使它正常工作.有一个指向 Java文件系统全局模式的链接,该链接又会链接到 getPathMatcher (字符串),其中列出了所有的通配符选项.其中之一是{a,b,c},它看起来完全符合我的需求,但是,如果将gs://bucket/some/path/201602{09,10,11}/*传递给TextIO.Read#from,则会显示无法扩展文件模式".

But I can't get this to work. There's a link to Java Filesystem glob patterns , which in turn links to getPathMatcher(String), that lists all the globbing options. One of them is {a,b,c}, which looks exactly like what I need, however, if I pass gs://bucket/some/path/201602{09,10,11}/* to TextIO.Read#from I get "Unable to expand file pattern".

也许文档的意思是仅支持 *?[…],如果是这种情况,我该如何构造一个Dataflow将接受的glob,并且可以匹配我上面所述的任意日期范围?

Maybe the docs mean that only *, ? and […] are supported, and if that is the case, how can I construct a glob that Dataflow will accept and that can match an arbitrary date range like the one I describe above?

更新:我发现我可以编写一段代码,以便我可以将路径前缀作为逗号分隔的列表传递,并从每个路径创建一个输入.使用Flatten转换,但这似乎是一种非常低效的方式.第一步似乎要读取所有输入文件,然后立即将它们再次写出到GCS上的临时位置.仅当所有输入都已被读取和写入时,实际处理才开始.在我正在写的工作中,这一步骤是完全不必要的.我希望作业读取第一个文件,开始处理它,然后读取下一个文件,依此类推.这只是造成了很多其他问题,我将尽力使它工作,但是感觉就像死胡同了.因为最初的重写.

Update: I've figured out that I can write a chunk of code to so that I can pass in the path prefixes as a comma separated list, create an input from each and use the Flatten transform, but that seems like a very inefficient way of doing it. It looks like the first step reads all input files and immediately write them out again to the temporary location on GCS. Only when all the inputs have been read and written the actual processing starts. This step is completely unnecessary in the job I'm writing. I want the job to read the first file, start processing it and read the next, and so on. This just caused a ton other problems, I'll try to make it work, but it feels like a dead end because of the initial rewriting.

推荐答案

文档确实确实意味着仅支持*?[...].这意味着字母或数字顺序的任意子集或范围不能表示为单个glob.

The docs do, indeed, mean that only *, ?, and [...] are supported. This means that arbitrary subsets or ranges in alphabetical or numeric order cannot be expressed as a single glob.

以下一些方法可能对您有用:

Here are some approaches that might work for you:

  1. 如果文件记录中还存在文件路径中表示的日期,那么最简单的解决方案是全部读取它们,并使用Filter转换选择所需的日期范围.
  2. 您尝试的方法是在单独的TextIO.Read转换中进行多次读取,并且将它们展平对于小型文件集是可以的;我们的 tf-idf示例可以做到这一点.您可以表示具有少量全局字符的任意数值范围,因此不必每次读取一个文件(例如,两个字符范围"23至67"为2[3-][3-5][0-9]6[0-7])
  3. 如果文件的子集更加随意,则glob/文件名的数量可能会超过最大图形大小,最后的建议是将文件列表放入PCollection中,并使用ParDo转换来读取每个文件并发出其内容.
  1. If the date represented in the file path is also present in the records in the files, then the simplest solution is to read them all and use a Filter transform to select the date range you are interested in.
  2. The approach you tried of many reads in a separates TextIO.Read transforms and flattening them is OK for small sets of files; our tf-idf example does this. You can express arbitrary numerical ranges with a small number of globs so this need not be one read per file (for example the two character range "23 through 67" is 2[3-] plus [3-5][0-9] plus 6[0-7])
  3. If the subset of files is more arbitrary then the number of globs/filenames may exceed the maximum graph size, and the last recommendation is to put the list of files into a PCollection and use a ParDo transform to read each file and emit its contents.

我希望这会有所帮助!

这篇关于如何指定数据流作业的多个输入路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆