如何使用TextIO将多个文件与名称匹配.在Cloud Dataflow中读取 [英] How to match multiple files with names using TextIO.Read in Cloud Dataflow

查看:116
本文介绍了如何使用TextIO将多个文件与名称匹配.在Cloud Dataflow中读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个gcs文件夹,如下所示:

I have a gcs folder as below:

gs://<bucket-name>/<folder-name>/dt=2017-12-01/part-0000.tsv
                                /dt=2017-12-02/part-0000.tsv
                                /dt=2017-12-03/part-0000.tsv
                                /dt=2017-12-04/part-0000.tsv
                                ...

我只想在Scio中使用sc.textFile()匹配dt=2017-12-02dt=2017-12-03下的文件,据我所知,该文件在下面使用TextIO.Read.from().

I want to match only the files under dt=2017-12-02 and dt=2017-12-03 using sc.textFile() in Scio, which uses TextIO.Read.from() underneath as far as I know.

我尝试过

gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv

gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv

都匹配零个文件:

INFO org.apache.beam.sdk.io.FileBasedSource - Filepattern gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv matched 0 files with total size 0

INFO org.apache.beam.sdk.io.FileBasedSource - Filepattern gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv matched 0 files with total size 0

执行此操作时有效的文件模式应该是什么?

What should be the valid filepattern on doing this?

推荐答案

您需要使用TextIO.readAll()转换来读取文件模式的PCollection<String>.可以通过Create.of()显式创建文件模式集合,也可以使用ParDo对其进行计算.

You need to use the TextIO.readAll() transform that reads a PCollection<String> of filepatterns. Create the collection of filepatterns either explicitly via Create.of() or you can compute it using a ParDo.

case class ReadPaths(paths: java.lang.Iterable[String]) extends PTransform[PBegin, PCollection[String]] {
  override def expand(input: PBegin) = {
    Create.of(paths).expand(input).apply(TextIO.readAll())
  }
}

val paths = Seq(
  "gs://<bucket-name>/<folder-name>/dt=2017-07-01/part-0000.tsv",
  "gs://<bucket-name>/<folder-name>/dt=2017-12-20/part-0000.tsv",
  "gs://<bucket-name>/<folder-name>/dt=2018-03-29/part-0000.tsv",
  "gs://<bucket-name>/<folder-name>/dt=2018-05-04/part-0000.tsv"
)

import scala.collection.JavaConverters._

sc.customInput("Read Paths", ReadPaths(paths.asJava))

这篇关于如何使用TextIO将多个文件与名称匹配.在Cloud Dataflow中读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆