用Apache Beam顺序读取文件和文件夹 [英] reading files and folders in order with apache beam

查看:72
本文介绍了用Apache Beam顺序读取文件和文件夹的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类型为year/month/day/hour/*的文件夹结构,我希望Beam能够按时间顺序将其作为无限制的来源来读取.具体来说,这意味着在记录的第一个小时内读取所有文件,并添加它们的内容以进行处理.然后,添加下一个小时要处理的文件内容,直到当前等待新文件到达的最新时间year/month/day/hour文件夹.

I have a folder structure of the type year/month/day/hour/*, and I'd like the beam to read this as an unbounded source in chronological order. Specifically, this means reading in all the files in the first hour on record and adding their contents for processing. Then, add the file contents of the next hour for processing, up until the current time where it waits for new files to arrive in the latest year/month/day/hour folder.

有可能用apache beam吗?

Is it possible to do this with apache beam?

推荐答案

所以我要做的是根据文件路径向每个元素添加时间戳.作为测试,我使用了以下示例.

So what I would do is to add timestamps to each element according to the file path. As a test I used the following example.

首先,如此答案中所述,您可以使用FileIO连续匹配文件模式.根据您的用例,这将有帮助,一旦您完成了回填,您就希望继续读取同一作业中的新到达文件.在这种情况下,我提供gs://BUCKET_NAME/data/**是因为我的文件将类似于gs://BUCKET_NAME/data/year/month/day/hour/filename.extension:

First of all, as explained in this answer, you can use FileIO to match continuously a file pattern. This will help as, per your use case, once you have finished with the backfill you want to keep reading new arriving files within the same job. In this case I provide gs://BUCKET_NAME/data/** because my files will be like gs://BUCKET_NAME/data/year/month/day/hour/filename.extension:

p
    .apply(FileIO.match()
    .filepattern(inputPath)
    .continuously(
        // Check for new files every minute
        Duration.standardMinutes(1),
        // Never stop checking for new files
        Watch.Growth.<String>never()))
    .apply(FileIO.readMatches())

观看频率和超时时间可以随意调整.

Watch frequency and timeout can be adjusted at will.

然后,在下一步中,我们将接收匹配的文件.我将使用ReadableFile.getMetadata().resourceId()来获取完整路径,并用"/"对其进行拆分以构建相应的时间戳.我将其四舍五入为小时,这里不考虑时区校正.使用readFullyAsUTF8String,我们将读取整个文件(如果整个文件不适合放入内存,请小心,建议在需要时将您的输入分片)并将其分成几行.使用ProcessContext.outputWithTimestamp,我们将向下游发送文件名和行的KV(不再需要文件名,但是它将有助于查看每个文件的来源)以及从路径派生的时间戳.请注意,我们正在将时间戳移回时间",这样可能会混淆水印启发式方法,并且您会收到诸如以下消息:

Then, in the next step we'll receive the matched file. I will use ReadableFile.getMetadata().resourceId() to get the full path and split it by "/" to build the corresponding timestamp. I round it to the hour and do not account for timezone correction here. With readFullyAsUTF8String we'll read the whole file (be careful if the whole file does not fit into memory, it is recommended to shard your input if needed) and split it into lines. With ProcessContext.outputWithTimestamp we'll emit downstream a KV of filename and line (the filename is not needed anymore but it will help to see where each file comes from) and the timestamp derived from the path. Note that we're shifting timestamps "back in time" so this can mess up with the watermark heuristics and you will get a message such as:

无法以时间戳2019-03-17T00:00:00.000Z输出.输出时间戳必须不早于当前输入的时间戳(2019-06-05T15:41:29.645Z)减去允许的时滞(0毫秒).有关更改允许的偏斜的详细信息,请参见DoFn#getAllowedTimestampSkew()Javadoc.

Cannot output with timestamp 2019-03-17T00:00:00.000Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-05T15:41:29.645Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.

为克服此问题,我将getAllowedTimestampSkew设置为Long.MAX_VALUE,但要考虑到已弃用该设置.验证码:

To overcome this I set getAllowedTimestampSkew to Long.MAX_VALUE but take into account that this is deprecated. ParDo code:

.apply("Add Timestamps", ParDo.of(new DoFn<ReadableFile, KV<String, String>>() {

    @Override
    public Duration getAllowedTimestampSkew() {
        return new Duration(Long.MAX_VALUE);
    }

    @ProcessElement
    public void processElement(ProcessContext c) {
        ReadableFile file = c.element();
        String fileName = file.getMetadata().resourceId().toString();
        String lines[];

        String[] dateFields = fileName.split("/");
        Integer numElements = dateFields.length;

        String hour = dateFields[numElements - 2];
        String day = dateFields[numElements - 3];
        String month = dateFields[numElements - 4];
        String year = dateFields[numElements - 5];

        String ts = String.format("%s-%s-%s %s:00:00", year, month, day, hour);
        Log.info(ts);

        try{
            lines = file.readFullyAsUTF8String().split("\n");

            for (String line : lines) {
                c.outputWithTimestamp(KV.of(fileName, line), new Instant(dateTimeFormat.parseMillis(ts)));
            }
        }

        catch(IOException e){
            Log.info("failed");
        }
    }}))

最后,我进入1小时FixedWindows并记录结果:

Finally, I window into 1-hour FixedWindows and log the results:

.apply(Window
    .<KV<String,String>>into(FixedWindows.of(Duration.standardHours(1)))
    .triggering(AfterWatermark.pastEndOfWindow())
    .discardingFiredPanes()
    .withAllowedLateness(Duration.ZERO))
.apply("Log results", ParDo.of(new DoFn<KV<String, String>, Void>() {
    @ProcessElement
    public void processElement(ProcessContext c, BoundedWindow window) {
        String file = c.element().getKey();
        String value = c.element().getValue();
        String eventTime = c.timestamp().toString();

        String logString = String.format("File=%s, Line=%s, Event Time=%s, Window=%s", file, value, eventTime, window.toString());
        Log.info(logString);
    }
}));

对我来说,它可以与.withAllowedLateness(Duration.ZERO)一起使用,但是根据设置的顺序可能会有所不同.请记住,该值太高会导致窗口打开时间更长,并使用更多持久性存储.

For me it worked with .withAllowedLateness(Duration.ZERO) but depending on the order you might need to set it. Keep in mind that a value too high will cause windows to be open for longer and use more persistent storage.

我设置了$BUCKET$PROJECT变量,我只上传了两个文件:

I set the $BUCKET and $PROJECT variables and I just upload two files:

gsutil cp file1 gs://$BUCKET/data/2019/03/17/00/
gsutil cp file2 gs://$BUCKET/data/2019/03/18/22/

并使用以下命令运行作业:

And run the job with:

mvn -Pdataflow-runner compile -e exec:java \
 -Dexec.mainClass=com.dataflow.samples.ChronologicalOrder \
      -Dexec.args="--project=$PROJECT \
      --path=gs://$BUCKET/data/** \
      --stagingLocation=gs://$BUCKET/staging/ \
      --runner=DataflowRunner"

结果:

完整代码

让我知道它是如何工作的.这只是一个入门示例,您可能需要调整窗口和触发策略,延迟等以适合您的用例

Let me know how this works. This was just an example to get started and you might need to adjust windowing and triggering strategies, lateness, etc to suit your use case

这篇关于用Apache Beam顺序读取文件和文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆