读取大量文件时,如何提高TextIO或AvroIO的性能? [英] How can I improve performance of TextIO or AvroIO when reading a very large number of files?

查看:70
本文介绍了读取大量文件时,如何提高TextIO或AvroIO的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TextIO.read() AvroIO.read()(以及其他一些Beam IO)的作者读取扩展为大量文件的文件模式(例如1M文件)时,默认情况下在当前的Apache Beam运行器中表现不佳。

TextIO.read() and AvroIO.read() (as well as some other Beam IO's) by default don't perform very well in current Apache Beam runners when reading a filepattern that expands into a very large number of files - for example, 1M files.

推荐答案

提前知道使用 TextIO 或 AvroIO 将扩展为大量文件,您可以使用最近添加了功能 .withHintMatchesManyFiles(),该功能当前在<$ c上实现$ c> TextIO 和 AvroIO

When you know in advance that the filepattern being read with TextIO or AvroIO is going to expand into a large number of files, you can use the recently added feature .withHintMatchesManyFiles(), which is currently implemented on TextIO and AvroIO.

例如:

PCollection<String> lines = p.apply(TextIO.read()
    .from("gs://some-bucket/many/files/*")
    .withHintMatchesManyFiles());

使用此提示可使转换以优化的方式执行,以读取大量文件:在这种情况下,可以读取的文件数量实际上是无限的,而且很可能管道会比没有此提示的情况下运行更快,更便宜且更可靠。

Using this hint causes the transforms to execute in a way optimized for reading a large number of files: the number of files that can be read in this case is practically unlimited, and most likely the pipeline will run faster, cheaper and more reliably than without this hint.

但是如果文件模式实际上只匹配少量文件(例如,几十个或几百个文件),则可能会比没有提示的情况更糟。

However, it may perform worse than without the hint if the filepattern actually matches only a small number of files (for example, a few dozen or a few hundred files).

,此提示会导致分别通过 TextIO.readAll() AvroIO.readAll()执行转换 read()的更灵活和可扩展的版本,允许读取文件模式的 PCollection< String> (其中每个 String 是文件模式),但有相同的警告:如果与文件模式匹配的文件总数很小,它们的性能可能比简单的 read() 与在管道构建时指定的文件模式。

Under the hood, this hint causes the transforms to execute via respectively TextIO.readAll() or AvroIO.readAll(), which are more flexible and scalable versions of read() that allow reading a PCollection<String> of filepatterns (where each String is a filepattern), with the same caveat: if the total number of files matching the filepatterns is small, they may perform worse than a simple read() with the filepattern specified at pipeline construction time.

这篇关于读取大量文件时,如何提高TextIO或AvroIO的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆