在Google云端平台上,如何使Python中令人尴尬的并行任务的吞吐量最大化? [英] How can I maximize throughput for an embarrassingly-parallel task in Python on Google Cloud Platform?

查看:159
本文介绍了在Google云端平台上,如何使Python中令人尴尬的并行任务的吞吐量最大化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用Apache Beam / Google Cloud Dataflow加速现有的Python应用程序。当系统在每个矩阵上运行聚类算法时,应用程序的瓶颈发生在随机排列输入矩阵 N (默认125,但可能更多)次之后。这些运行完全相互独立。我已经捕获了以下管道的顶端:





这处理默认的125个排列。正如你所看到的,只有 RunClustering 步骤花费了相当多的时间(还有11个步骤没有在下面显示,总共增加了11秒)。我今天早些时候运行了这个管道,只进行了1次置换,并且 Run Clustering 步骤需要3秒(足够接近上述时间的1/125)。



无论输入是什么 N $ b

我都希望在3-4秒内完成 RunClustering / code>是。我的理解是,Dataflow是加速Google云平台上令人尴尬的并行计算的正确工具,因此我花了几个星期的时间来学习并移植我的代码。我的理解是否正确?我也尝试在问题上抛出更多的机器(而不是Autoscaling,无论出于何种原因,它只能扩展到2-3台机器*),并指定更强大的机器类型,但这些都没有帮助。







*这是因为VM启动时间太长吗?有没有办法使用快速配置的虚拟机,如果是这样的话?另一个问题是如何减少管道启动时间;如果用户无法快速获得结果,那么这是一个交易断路器,并且Dataflow作业总时间为13-14分钟(与管道中已经过多的6-7相比)是不可接受的。

解决方案

您的管道正在遭受过度
一般融合是一个非常重要的优化,但过度融合也是一个常见的问题,理想情况下,Dataflow将能够自动缓解(如
自动减轻不平衡分片),但它甚至更难做到,尽管一些想法正在实施中。



与此同时,您需要更改代码以插入重新组合(通过键/取消组合将执行组合 - 通过键操作不会在组中发生聚合)。请参阅预防融合; 防止Google数据流融合的最佳方法包含一些示例代码。


I am trying to use Apache Beam/Google Cloud Dataflow to speed up an existing Python application. The bottleneck of the application occurs after randomly permuting an input matrix N (default 125, but could be more) times, when the system runs a clustering algorithm on each matrix. The runs are fully independent of one another. I've captured the top of the pipeline below:

This processes the default 125 permutations. As you can see, only the RunClustering step takes an appreciable amount of time (there are 11 more steps not shown below that total to 11 more seconds). I ran the pipeline earlier today for just 1 permutation, and the Run Clustering step takes 3 seconds (close enough to 1/125th the time shown above).

I'd like the RunClustering step to finish in 3-4 seconds no matter what the input N is. My understanding is that Dataflow is the correct tool for speeding up embarrassingly-parallel computation on Google Cloud Platform, so I've spent a couple weeks learning it and porting my code. Is my understanding correct? I've also tried throwing more machines at the problem (instead of Autoscaling, which, for whatever reason, only scales up to 2-3 machines*) and specifying more powerful machine types, but those don't help.



*Is this because of a long startup time for VMs? Is there a way to use quickly-provisioned VMs, if that's the case? Another question I have is how to cut down on the pipeline startup time; it's a deal breaker if users can't get results back quickly, and the fact that the total Dataflow job time is 13–14 minutes (compared to the already excessive 6–7 for the pipeline) is unacceptable.

解决方案

Your pipeline is suffering from excessive fusion, and ends up doing almost everything on one worker. This is also why autoscaling doesn't scale higher: it detects that it is unable to parallelize your job's code, so it prefers not to waste extra workers. This is also why manually throwing more workers at the problem doesn't help.

In general fusion is a very important optimization, but excessive fusion is also a common problem that, ideally, Dataflow would be able to mitigate automatically (like it automatically mitigates imbalanced sharding), but it is even harder to do, though some ideas for that are in the works.

Meanwhile, you'll need to change your code to insert a "reshuffle" (a group by key / ungroup will do - fusion never happens across a group by key operation). See Preventing fusion; the question Best way to prevent fusion in Google Dataflow? contains some example code.

这篇关于在Google云端平台上,如何使Python中令人尴尬的并行任务的吞吐量最大化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆