使用 Apache Beam 进行 Dataflow 批量加载的性能问题 [英] Performance issues on Dataflow batch loads using Apache Beam

查看:39
本文介绍了使用 Apache Beam 进行 Dataflow 批量加载的性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对数据流批量加载进行性能基准测试,发现与 Bigquery 命令行工具上的相同加载相比,加载速度太慢了.

I was doing a performance benchmarking of dataflow batch loads and found that the loads were just too slow when compared against the same loads on Bigquery command line tool.

文件大小约为 20 MB,包含数百万条记录.我尝试了不同的机器类型,并在 n1-highmem-4 上获得了最佳的加载性能,加载目标 BQ 表的加载时间约为 8 分钟.

The file size was around 20 MB with millions of records. I tried different machine types and got the best load performance on n1-highmem-4 with the approx load time of 8 minutes in loading the target BQ table.

通过在命令行实用程序上运行 BQ 命令来应用相同的表加载时,处理和加载相同数量的数据几乎不需要 2 分钟.关于使用 Dataflow 作业的这种糟糕的加载性能有什么见解吗?如何提高性能使其与 BQ 命令行实用程序相媲美?

When the same table load was applied by running BQ command on the command-line utility, it hardly took 2 minutes to process and load the same volume of data. Any insights about this poor load performance using Dataflow jobs? How to improve the performance to make it comparable to BQ command line utility?

推荐答案

很可能需要几分钟时间来启动和关闭 VM.如果您正在做一些可以使用 BQ CLI 直接完成的事情,那么为此目的使用 Dataflow 可能有点矫枉过正.但是,您可以使用更多详细信息(例如您的代码和 Dataflow 作业 ID)更新您的问题 - 也许还有其他一些效率低下的问题.

Most likely, a few minutes are being spent on starting and shutting down VMs. If you're doing something that can directly be done using BQ CLI, then using Dataflow for that purpose is likely overkill. However, you can update your question with more details (e.g. your code and the Dataflow job id) - maybe there's something else inefficient going on.

这篇关于使用 Apache Beam 进行 Dataflow 批量加载的性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆