将过去30天的GCS文件夹作为名称读入管道 [英] TextIO.Read GCS folders into pipeline with past 30 days date as name

查看:91
本文介绍了将过去30天的GCS文件夹作为名称读入管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读取过去30天的滚动窗口进入我的管道,例如2017年1月15日,我想阅读:

 > gs:// bucket / 20170115 / * 
> gs:// bucket / 20170114 / *
> ;.
> ;.
> ;.
> gs:// bucket / 20161216 / *

类似的问题,但没有很好的例子



I我试图避免做30个Text.IO.Read步骤,然后将所有Pcollections变为一个,这会导致流水线中的热分片。

p>当从GCS读取文件时,TextIO支持与GCS相同的通配符模式,如下所述:通配符名称



在回答 question you linked ,bullet#2建议形成少量的数字来代表你的全部范围:


例如,两个字符范围23至67是 2 [3-] 加上
[3 -5] [0-9] 加上 6 [0-7]







TextIO 也有一个新的API readAll() ,它允许您将输入文件动态指定为数据。这允许您传入您需要的确切文件名集:

  private static List< String> generate30DayFileGlobs(DateTime now){
// ..
}

public static void main(){
Pipeline p = // ..

p.apply(Create。< String> of(generate30DayFileGlobs(DateTime.now())));
.apply(TextIO.readAll());

// ..
}

新的 TextIO.readAll() API尚未发布,但您可以通过指定Beam工件版本 2.2.0-SNAPSHOT 。 2.2.0版本正在进行中,并且应该在9月的某个时候发布。


I want to read a rolling window of past 30 days into my pipeline for e.g. on Jan 15 2017, I want to read:

> gs://bucket/20170115/* 
> gs://bucket/20170114/*
>. 
>.
>.
> gs://bucket/20161216/*

This says ("*", "?", "[..]") glob patterns are supported

Similar question, but with no good example

I am trying to avoid doing 30 Text.IO.Read steps then Flattening all Pcollections into one, this causes hot shards in the pipeline.

解决方案

When reading files from GCS, TextIO supports the same wildcard patterns as GCS, described here: Wildcard Names.

In the answer for the question you linked, bullet #2 suggests forming a small number of globs to represent your full range:

for example the two character range "23 through 67" is 2[3-] plus [3-5][0-9] plus 6[0-7]


TextIO also has a new API readAll() which allows you to specify input files dynamically as data. This allows you to pass in the exact set of filenames you need:

private static List<String> generate30DayFileGlobs(DateTime now) {
  // ..
}

public static void main() {
  Pipeline p = // ..

  p.apply(Create.<String>of(generate30DayFileGlobs(DateTime.now())));
   .apply(TextIO.readAll());

  // ..
}

The new TextIO.readAll() API has not yet been released, but you can build from master by specifying the Beam artifact version 2.2.0-SNAPSHOT. The 2.2.0 release is in progress and should be available sometime in September.

这篇关于将过去30天的GCS文件夹作为名称读入管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆