排查 apache 光束管道导入错误 [BoundedSource 对象大于允许的限制] [英] Troubleshooting apache beam pipeline import errors [BoundedSource objects is larger than the allowable limit]

查看:19
本文介绍了排查 apache 光束管道导入错误 [BoundedSource 对象大于允许的限制]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在谷歌云存储上存储了一堆文本文件(~1M).当我将这些文件读入 Google Cloud DataFlow 管道进行处理时,总是出现以下错误:

I have a bunch of text files (~1M) stored on google cloud storage. When I read these files into Google Cloud DataFlow pipeline for processing, I always get the following error:

Total size of the BoundedSource objects returned by BoundedSource.split() operation is larger than the allowable limit

故障排除页面说:

如果您通过 TextIO、AvroIO 或其他一些基于文件的源读取大量文件,您可能会遇到此错误.特定限制取决于您的源的详细信息(例如,在 AvroIO.Read 中嵌入架构将允许更少的文件),但它的数量级是在一个管道中包含数万个文件.

You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

这是否意味着我必须将文件分成更小的批次,而不是一次全部导入?

Does that mean I have to split my files into smaller batches, rather than import them all at once?

我正在使用数据流 python sdk 来开发管道.

I'm using dataflow python sdk for developing pipelines.

推荐答案

将文件分批是一个合理的解决方法 - 例如使用多个 ReadFromText 转换或使用多个管道读取它们.我认为在 1M 文件级别,第一种方法行不通.最好使用新功能:

Splitting your files into batches is a reasonable workaround - e.g. read them using multiple ReadFromText transforms, or using multiple pipelines. I think at the level of 1M files, the first approach will not work. It's better to use a new feature:

读取大量文件的最佳方法是使用 ReadAllFromText.它没有可扩展性限制(尽管如果您的文件数量非常少,它的性能会更差).

The best way to read a very large number of files is using ReadAllFromText. It does not have scalability limitations (though it will perform worse if your number of files is very small).

它将在 Beam 2.2.0 中可用,但如果您愿意使用快照构建,它已经在 HEAD 中可用.

It will be available in Beam 2.2.0, but it is already available at HEAD if you're willing to use a snapshot build.

另见 在读取大量文件时如何提高 TextIO 或 AvroIO 的性能? 对于 Java 版本.

See also How can I improve performance of TextIO or AvroIO when reading a very large number of files? for a Java version.

这篇关于排查 apache 光束管道导入错误 [BoundedSource 对象大于允许的限制]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆