使用 withTemplateCompatibility 的 BigQueryIO 读取性能 [英] BigQueryIO Read performance using withTemplateCompatibility

查看:26
本文介绍了使用 withTemplateCompatibility 的 BigQueryIO 读取性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Beam 2.1.0 在从 BigQuery 读取的模板管道中存在一个错误,这意味着它们只能执行一次.更多详情请见 https://issues.apache.org/jira/browse/BEAM-2058

Apache Beam 2.1.0 had a bug with template pipelines that read from BigQuery which meant they could only be executed once. More details here https://issues.apache.org/jira/browse/BEAM-2058

此问题已在 Beam 2.2.0 的发布中得到修复,您现在可以使用 withTemplateCompatibility 选项从 BigQuery 中读取数据,您的模板管道现在可以多次运行.

This has been fixed with the release of Beam 2.2.0, you can now read from BigQuery using the withTemplateCompatibility option, your template pipeline can now be run multiple times.

  pipeline
    .apply("Read rows from table."
         , BigQueryIO.readTableRows()
                     .withTemplateCompatibility()
                     .from("<your-table>")
                     .withoutValidation())

这个实现似乎给 BigQueryIO 读取操作带来了巨大的性能成本,我现在有批处理管道,运行时间为 8-11 分钟,现在持续运行 45-50 分钟> 完成.两个管道之间的唯一区别是 .withTemplateCompatibility().

This implementation seems to come with a huge performance cost to BigQueryIO read operation, I now have batch pipelines what ran in 8-11 minutes now consistently taking 45-50 minutes to complete. The only difference between both pipelines is the .withTemplateCompatibility().

我正在尝试了解性能大幅下降的原因,以及是否有任何改进方法.

Am trying to understand the reasons for the huge drop in performance and if there is any way to improve them.

谢谢.

解决方案:基于 jkff 的输入.

Solution: based on jkff's input.

  pipeline
    .apply("Read rows from table."
         , BigQueryIO.readTableRows()
                     .withTemplateCompatibility()
                     .from("<your-table>")
                     .withoutValidation())
    .apply("Reshuffle",  Reshuffle.viaRandomKey())

推荐答案

我怀疑这是因为 withTemplateCompatibility 以禁用 动态重新平衡这个读取步骤.

I suspect this is due to the fact that withTemplateCompatibility comes at the cost of disabling dynamic rebalancing for this read step.

我希望它只有在您读取少量或中等数量的数据,但对其执行非常繁重的处理时才会产生重大影响.在这种情况下,尝试将 Reshuffle.viaRandomKey() 添加到您的 BigQueryIO.read().它将实现数据的临时副本,但会更好地并行化下游处理.

I would expect it to have significant impact only if you're reading a small or moderate amount of data, but performing very heavy processing on it. In this case, try adding a Reshuffle.viaRandomKey() onto your BigQueryIO.read(). It will materialize a temporary copy of the data, but will parallelize downstream processing much better.

这篇关于使用 withTemplateCompatibility 的 BigQueryIO 读取性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆