读取大量文件时,如何提高 TextIO 或 AvroIO 的性能? [英] How can I improve performance of TextIO or AvroIO when reading a very large number of files?

查看:13
本文介绍了读取大量文件时,如何提高 TextIO 或 AvroIO 的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TextIO.read()AvroIO.read()(以及其他一些 Beam IO)默认在当前的 Apache Beam 运行器中表现不佳读取扩展为大量文件(例如 1M 个文件)的文件模式时.

TextIO.read() and AvroIO.read() (as well as some other Beam IO's) by default don't perform very well in current Apache Beam runners when reading a filepattern that expands into a very large number of files - for example, 1M files.

如何高效读取如此大量的文件?

How can I read such a large number of files efficiently?

推荐答案

当您提前知道使用 TextIOAvroIO 读取的文件模式将扩展为大量文件,您可以使用最近添加的功能.withHintMatchesManyFiles(),目前在 TextIOAvroIO 上实现.

When you know in advance that the filepattern being read with TextIO or AvroIO is going to expand into a large number of files, you can use the recently added feature .withHintMatchesManyFiles(), which is currently implemented on TextIO and AvroIO.

例如:

PCollection<String> lines = p.apply(TextIO.read()
    .from("gs://some-bucket/many/files/*")
    .withHintMatchesManyFiles());

使用这个提示会导致转换以一种为读取大量文件而优化的方式执行:在这种情况下可以读取的文件数量实际上是无限的,并且很可能管道会运行得更快、更便宜和更多比没有这个提示可靠.

Using this hint causes the transforms to execute in a way optimized for reading a large number of files: the number of files that can be read in this case is practically unlimited, and most likely the pipeline will run faster, cheaper and more reliably than without this hint.

但是,如果文件模式实际上只匹配少量文件(例如,几十个或几百个文件),它的性能可能比没有提示的效果更差.

However, it may perform worse than without the hint if the filepattern actually matches only a small number of files (for example, a few dozen or a few hundred files).

在幕后,这个提示导致转换分别通过 TextIO.readAll()AvroIO.readAll() 执行,它们是更灵活和可扩展的read() 允许读取文件模式的 PCollection(其中每个 String 是一个文件模式),具有相同的警告:如果与文件模式匹配的文件总数很少,它们的性能可能比在管道构建时指定的文件模式的简单 read() 性能更差.

Under the hood, this hint causes the transforms to execute via respectively TextIO.readAll() or AvroIO.readAll(), which are more flexible and scalable versions of read() that allow reading a PCollection<String> of filepatterns (where each String is a filepattern), with the same caveat: if the total number of files matching the filepatterns is small, they may perform worse than a simple read() with the filepattern specified at pipeline construction time.

这篇关于读取大量文件时,如何提高 TextIO 或 AvroIO 的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆