BigQueryIO使用withTemplateCompatibility读取性能 [英] BigQueryIO Read performance using withTemplateCompatibility

查看:99
本文介绍了BigQueryIO使用withTemplateCompatibility读取性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Beam 2.1.0有一个从BigQuery读取的模板管道错误,这意味着它们只能执行一次.此处的更多详细信息 https://issues.apache.org/jira/browse/BEAM-2058

Apache Beam 2.1.0 had a bug with template pipelines that read from BigQuery which meant they could only be executed once. More details here https://issues.apache.org/jira/browse/BEAM-2058

Beam 2.2.0发行版已修复此问题,现在您可以使用 withTemplateCompatibility 选项从BigQuery中读取,您的模板管道现在可以多次运行.

This has been fixed with the release of Beam 2.2.0, you can now read from BigQuery using the withTemplateCompatibility option, your template pipeline can now be run multiple times.

  pipeline
    .apply("Read rows from table."
         , BigQueryIO.readTableRows()
                     .withTemplateCompatibility()
                     .from("<your-table>")
                     .withoutValidation())

此实现似乎给BigQueryIO读取操作带来了巨大的性能成本,我现在拥有在 8-11分钟中运行的批处理管道,现在持续花费 45-50分钟 >完成.这两个管道之间的唯一区别是 .withTemplateCompatibility().

This implementation seems to come with a huge performance cost to BigQueryIO read operation, I now have batch pipelines what ran in 8-11 minutes now consistently taking 45-50 minutes to complete. The only difference between both pipelines is the .withTemplateCompatibility().

我正试图了解性能大幅下降的原因以及是否有任何方法可以改善它们.

Am trying to understand the reasons for the huge drop in performance and if there is any way to improve them.

谢谢.

解决方案:基于jkff的输入.

Solution: based on jkff's input.

  pipeline
    .apply("Read rows from table."
         , BigQueryIO.readTableRows()
                     .withTemplateCompatibility()
                     .from("<your-table>")
                     .withoutValidation())
    .apply("Reshuffle",  Reshuffle.viaRandomKey())

推荐答案

我怀疑这是由于withTemplateCompatibility是以禁用

I suspect this is due to the fact that withTemplateCompatibility comes at the cost of disabling dynamic rebalancing for this read step.

我希望仅当您正在读取少量或中等数量的数据但对其进行非常繁重的处理时,它才会产生重大影响.在这种情况下,请尝试在BigQueryIO.read()上添加Reshuffle.viaRandomKey().它将实现数据的临时副本,但将更好地并行化下游处理.

I would expect it to have significant impact only if you're reading a small or moderate amount of data, but performing very heavy processing on it. In this case, try adding a Reshuffle.viaRandomKey() onto your BigQueryIO.read(). It will materialize a temporary copy of the data, but will parallelize downstream processing much better.

这篇关于BigQueryIO使用withTemplateCompatibility读取性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆