从Azure WebJob间歇性启动的Azure批处理CloudJob从不执行任务 [英] Azure Batch CloudJob Kicked Off From Azure WebJob Intermittently Never Executes Tasks

查看:93
本文介绍了从Azure WebJob间歇性启动的Azure批处理CloudJob从不执行任务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

早上好!

TL; DR:

  1. 我们是否要为Azure Batch中的空闲节点时间收费?
  2. 是否有最佳实践来确保CloudTasks始终在具有AutoPoolSpecification的CloudJob上执行?

我有一个每晚使用Azure Batch .NET API来运行,监视和终止Azure Batch CloudJob的Azure WebJob.它将创建一个CloudJob对象,其中包含用于部署的资源文件,一个AutoPoolSpecification以及动态数量的CloudTask和 基于上下文和参数的子CloudTasks.

I have a nightly Azure WebJob that uses the Azure Batch .NET API to run, monitor, and terminate an Azure Batch CloudJob. It creates a CloudJob object with resource files for deployment, an AutoPoolSpecification, and a dynamic number of CloudTasks and child CloudTasks based on context and arguments.

一年前,当我首次构建此体系结构时,当池的节点(我们正在使用最小的VM进行生产)的后台处理时,没有任务相关性的任务将在大约10-15分钟内开始.这样他们的孩子就可以轻而易举地奔跑 当VM倒刺或根本没有启动任务时,到处都是.在过去的几个月中,我注意到批处理服务中有两个降级的地方:

When I first built this architecture a year ago, the tasks with no task dependencies would start in around 10-15 minutes when the pool's nodes (using the smallest VMs are we're pre-production) spooled up. Then their children would run with only minor flukes here and there when a VM barfed or a task simply was never started. Over the last few months, I've noticed two degradations in the Batch service:

  1. 此启动时间有效地增加了一倍,现在任务最多需要30分钟才能启动(给予或接受,观察到但未使用Batch Explorer进行测量).这不是很关键.
  2. 随机(也许每月一次?)这些任务永远不会执行.创建并启动了节点,但是它们从未吞噬任何任务.我的WebJob运行了12个小时,等待所有任务完成,然后再下载其日志并杀死它们 作业(以及它的自动合并).当发生此问题时,任务保持活动"状态.状态,直到超时逻辑终止所有操作.如果我再次启动WebJob,通常可以正常工作.
  1. This start up time frame has effectively doubled, with tasks now taking up to 30 minutes to start (give or take, observed but not measured using Batch Explorer). This is not critical.
  2. Randomly, (maybe once a month?) the tasks never get executed.  The nodes are created and started, but they never gobble up any of the tasks. My WebJob runs for 12 hours, waiting for all the tasks to complete before downloading their logs and killing the job (and it's auto-pool along with it). When this issues occurs, the tasks remain in "Active" state until my timeout logic terminates everything. If I kick off my WebJob again, it'll usually work.

正如我所说,我们还没有生活,也没有适当的技术支持计划.因此,通常来说,这种预期的行为是否是由我的体系结构导致的,需要太多本质上虚弱的VM?如果没有,我们会在20个VM的12小时内得到更改吗? 实际带宽被消耗了吗?我看到许多空闲节点计数"并且没有开始任务失败的节点计数";在批处理帐户指标中.但是,当我查看总核心小时数"时,仪表似乎正在运行.图...

As I said, we're not live yet, and don't have a technical support plan in place. So, generally, is this expected behavior caused by my architecture demanding too much of what are essentially ethereal VMs? Do we get changed for 12 hours of 20 VMs when no actual bandwidth was consumed? I see lots of "Idle Node Counts" and no "Start Task Failed Node Counts" in the Batch account metrics. However, the meter seems to running when I look at the "Total core hours" graph...

有更好的方法吗?或者某个地方可能有错误?

Is there a better way? Or could there be a bug somewhere?

非常感谢!

Chris

推荐答案

感谢克里斯.我正在离线处理此问题,一旦有具体答案,就会更新您. 
Thanks for the question Chris. I am working on this offline and will update you once I have a concrete answer. 


这篇关于从Azure WebJob间歇性启动的Azure批处理CloudJob从不执行任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆