在 Apache Beam 中观察与文件模式匹配的新文件 [英] Watching for new files matching a filepattern in Apache Beam

查看:21
本文介绍了在 Apache Beam 中观察与文件模式匹配的新文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 GCS 或其他受支持的文件系统上有一个目录,外部进程正在向其中写入新文件.

I have a directory on GCS or another supported filesystem to which new files are being written by an external process.

我想编写一个 Apache Beam 流媒体管道,它会持续监视此目录中的新文件,并在每个新文件到达时读取和处理它.这可能吗?

I would like to write an Apache Beam streaming pipeline that continuously watches this directory for new files and reads and processes each new file as it arrives. Is this possible?

推荐答案

这可以从 Apache Beam 2.2.0 开始.多个 API 支持此用例:

This is possible starting with Apache Beam 2.2.0. Several APIs support this use case:

如果您使用 TextIOAvroIO,它们会通过 TextIO.read().watchForNewFiles() 和相同的方式明确支持这一点在 readAll() 上,例如:

If you're using TextIO or AvroIO, they support this explicitly via TextIO.read().watchForNewFiles() and the same on readAll(), for example:

PCollection<String> lines = p.apply(TextIO.read()
    .from("gs://path/to/files/*")
    .watchForNewFiles(
        // Check for new files every 30 seconds
        Duration.standardSeconds(30),
        // Never stop checking for new files
        Watch.Growth.<String>never()));

如果您使用不同的文件格式,您可以使用 FileIO.match().continuously()FileIO.matchAll().continuously()支持相同的 API,结合 FileIO.readMatches().

If you're using a different file format, you may use FileIO.match().continuously() and FileIO.matchAll().continuously() which support the same API, in combination with FileIO.readMatches().

API 支持指定检查新文件的频率以及何时停止检查(支持的条件是例如如果在给定时间内没有新输出出现"、在观察 N 个输出之后"、在给定时间之后"开始检查"及其组合).

The APIs support specifying how often to check for new files, and when to stop checking (supported conditions are e.g. "if no new output appears within a given time", "after observing N outputs", "after a given time since starting to check" and their combinations).

请注意,目前此功能目前仅适用于 Direct runner 和 Dataflow runner,并且仅适用于 Java SDK.一般来说,它适用于任何支持 Splittable DoFn 的运行器(参见 能力矩阵).

Note that right now this feature currently works only in the Direct runner and the Dataflow runner, and only in the Java SDK. In general, it will work in any runner that supports Splittable DoFn (see capability matrix).

这篇关于在 Apache Beam 中观察与文件模式匹配的新文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆