解决apache光束管道导入错误[BoundedSource对象大于允许的限制] [英] Troubleshooting apache beam pipeline import errors [BoundedSource objects is larger than the allowable limit]

查看:61
本文介绍了解决apache光束管道导入错误[BoundedSource对象大于允许的限制]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Google云端存储中存储了一堆文本文件(〜1M).当我将这些文件读入Google Cloud DataFlow管道进行处理时,总是会出现以下错误:

I have a bunch of text files (~1M) stored on google cloud storage. When I read these files into Google Cloud DataFlow pipeline for processing, I always get the following error:

Total size of the BoundedSource objects returned by BoundedSource.split() operation is larger than the allowable limit

故障排除页面显示:

如果通过TextIO,AvroIO或其他基于文件的源读取大量文件,则可能会遇到此错误.具体限制取决于源的详细信息(例如,在AvroIO中嵌入架构.读取将允许更少的文件),但是在一个管道中大约有成千上万个文件.

You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

这是否意味着我必须将文件分成较小的批处理,而不是一次导入所有文件?

Does that mean I have to split my files into smaller batches, rather than import them all at once?

我正在使用数据流python sdk开发管道.

I'm using dataflow python sdk for developing pipelines.

推荐答案

将文件分为几批是一种合理的解决方法-例如使用多个ReadFromText转换或使用多个管道读取它们.我认为在1M文件级别上,第一种方法行不通.最好使用新功能:

Splitting your files into batches is a reasonable workaround - e.g. read them using multiple ReadFromText transforms, or using multiple pipelines. I think at the level of 1M files, the first approach will not work. It's better to use a new feature:

读取大量文件的最佳方法是使用

The best way to read a very large number of files is using ReadAllFromText. It does not have scalability limitations (though it will perform worse if your number of files is very small).

它将在Beam 2.2.0中可用,但是如果您愿意使用快照版本,则在HEAD中已经可用.

It will be available in Beam 2.2.0, but it is already available at HEAD if you're willing to use a snapshot build.

另请参阅

See also How can I improve performance of TextIO or AvroIO when reading a very large number of files? for a Java version.

这篇关于解决apache光束管道导入错误[BoundedSource对象大于允许的限制]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆