Google Cloud Dataflow中的自动缩放无法正常工作 [英] Autoscaling in Google Cloud Dataflow is not working as expected

查看:70
本文介绍了Google Cloud Dataflow中的自动缩放无法正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试按照本文.我是通过以下代码设置相关算法来做到这一点的:

I am trying to enable autoscaling in my dataflow job as described in this article. I did that by setting the relevant algorithm via the following code:

DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);

options.setAutoscalingAlgorithm(AutoscalingAlgorithmType.THROUGHPUT_BASED)

设置并部署我的工作后,它始终可以在最大范围内工作.可用的CPU数量,即如果我将最大工作程序数量设置为10,则它会使用所有10个CPU,尽管平均CPU使用率约为50%.这种THROUGHPUT_BASED算法如何工作,我在哪里出错?

After I set this and deployed my job, it always works with the max. number of CPUs available, i.e. if I set max number of workers to 10, then it uses all 10 CPUs although average CPU usage is about 50%. How does this THROUGHPUT_BASED algorithm works and where I am making mistake?

谢谢.

推荐答案

尽管Autoscaling尝试减少积压和CPU,但优先考虑减少积压.特定值积压很重要,Dataflow会将积压秒数"大致计算为积压/吞吐量",并尝试将其保持在10秒以下.

Although Autoscaling tries to reduce both the backlog and CPU, backlog reduction takes priority. Specific values backlog matters, Dataflow calculates 'backlog in seconds' roughly as 'backlog / throughput' and tries to keep it below 10 seconds.

在您的情况下,我认为阻止降级到10的原因是由于有关用于管道执行的永久性磁盘(PD)的策略.当最大工作线程数为10时,Dataflow使用10个永久磁盘,并尝试随时保持工作线程数,以使这些磁盘大致相等地分布.因此,当管道的最大工人数为10时,它会尝试将其缩减为5,而不是7或8.此外,在缩减为不超过80%之后,它还会尝试将预估的CPU保持不变.

In your case, I think what is preventing downscaling from 10 is due to policy regarding persistent disks (PDs) used for pipeline execution. When max workers is 10, Dataflow uses 10 persistent disks and tries to keep the number of workers at any time such that these disks are distributed roughly equally. As a consequence when the pipeline is at its max workers of 10, it tries to downscale to 5 rather than 7 or 8. In addition, it tries to keep projected CPU after downscaling to no more than 80%.

这两个因素可能有效地防止了您案例的规模缩小.如果10位工人的CPU利用率为50%,则预计5位工人的CPU利用率为100%,因此不会降低规模,因为它高于目标80%.

These two factors might be effectively preventing downscaling in your case. If CPU utilization is 50% with 10 workers, the projected CPU utilization is 100% for 5 workers, so it does not downscale since it is above the target 80%.

Google Dataflow正在开发一种新的执行引擎,该引擎不依赖于永久性磁盘,并且不受缩减规模的限制.

Google Dataflow is working on new execution engine that does not depend on persistent disks and does not suffer from the limitation of amout of downscaling.

为此,一种解决方法是设置更高的max_workers,并且您的管道可能仍保持在10或以下.但是,这会导致PD的成本略有增加.

A work around for this is to set higher max_workers and your pipeline might still stay at 10 or below. But that incurs a small increase in cost for PDs.

另一个遥远的可能性是,即使在增加了估计的积压秒数"之后,即使有足够的CPU,有时也可能不会保持在10秒以下.这可能是由于各种因素(用户代码处理,pubsub批处理等)引起的.想知道这是否会影响您的管道.

Another remote possibility is that sometimes even after upscaling estimated 'backlog seconds' might not stay below 10 seconds even with enough CPU. This could be due to various factors (user code processing, pubsub batching, etc). Would like to hear if that is affecting your pipeline.

这篇关于Google Cloud Dataflow中的自动缩放无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆