数据流性能问题 [英] Dataflow performance issues

查看:108
本文介绍了数据流性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道几周前对CDF服务进行了更新(更改了默认工作程序类型和附加的PD),并且很明显这会使批处理作业变慢.但是,我们的工作绩效已经下降到无法实际满足我们的业务需求的程度.

I'm aware that an update was made to the CDF service a few weeks ago (default worker type & attached PD were changed), and it was made clear that it would make batch jobs slower. However, the performance of our jobs has degraded beyond the point of them actually fulfilling our business needs.

例如,对于我们的一项特别的工作:它从BigQuery的一个表中读取约270万行,具有6个侧面输入(BQ表),进行一些简单的String转换,最后将多个输出(3)写入BigQuery.过去通常要花5-6分钟,现在却要花15-20分钟之间的时间-不管我们要扔掉多少虚拟机.

For example, for one of our jobs in particular: it reads ~2.7 million rows from a table in BigQuery, has 6 side inputs (BQ tables), does some simple String transformations, and finally writes multiple outputs (3) to BigQuery. This used to take 5-6 minutes and now it takes anywhere between 15-20 mins - not matter how many VM's we chuck at it.

我们可以做些什么来使速度恢复到我们以前看到的速度吗?

Is there anything we can do to get the speeds back up to what we used to see?

以下是一些统计信息:

  1. 从具有2,744,897行(294MB)的BQ表中读取
  2. 6个BQ侧输入
  3. 3个到BQ的多输出,其中2个是2,744,897,其他1,500行
  4. 在Asia-east1-b地区奔跑
  5. 以下时间包括工人池旋转和拆除

10个虚拟机(n1-standard-2) 16分钟5秒 2015-04-22_19_42_20-4740106543213058308

10 VMs (n1-standard-2) 16 mins 5 sec 2015-04-22_19_42_20-4740106543213058308

10个虚拟机(n1-standard-4) 17分11秒 2015-04-22_20_04_58-948224342106865432

10 VMs (n1-standard-4) 17 min 11 sec 2015-04-22_20_04_58-948224342106865432

10个虚拟机(n1-standard-1) 18分钟44秒 2015-04-22_19_42_20-4740106543213058308

10 VMs (n1-standard-1) 18 min 44 sec 2015-04-22_19_42_20-4740106543213058308

20个虚拟机(n1-standard-2) 22分53秒 2015-04-22_21_26_53-18171886778433479315

20 VMs (n1-standard-2) 22 min 53 sec 2015-04-22_21_26_53-18171886778433479315

50个虚拟机(n1-standard-2) 17分26秒 2015-04-22_21_51_37-16026777746175810525

50 VMs (n1-standard-2) 17 min 26 sec 2015-04-22_21_51_37-16026777746175810525

100个虚拟机(n1-standard-2) 19分钟33秒 2015-04-22_22_32_13-9727928405932256127

100 VMs (n1-standard-2) 19 min 33 sec 2015-04-22_22_32_13-9727928405932256127

推荐答案

我们已找到问题所在.正是当侧面输入从BigQuery表中读取数据时,流已输入了该数据,而不是批量加载了该数据.当我们复制表格并从副本中读取内容时,一切正常.

We tracked down the issue. It is when the side-input is reading from a BigQuery table that has had its data streamed in, rather than bulk loaded. When we copy the table(s), and read from the copies instead everything works fine.

但是,这只是一种解决方法.数据流应该能够将BigQuery中的流式表作为侧面输入处理.

However, this is just a workaround. Dataflow should be able to handle streamed tables in BigQuery as side-inputs.

这篇关于数据流性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆