AWS Step Functions与Luigi进行编排 [英] AWS Step Functions vs Luigi for orchestration

查看:586
本文介绍了AWS Step Functions与Luigi进行编排的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的团队为小型项目提供了整体服务,但为了进行重新架构和扩展,我们计划迁移到Amazon AWS的云服务,并评估编排是将Luigi作为容器任务运行还是使用AWS Step用功能代替吗?我对他们中的任何一个都没有任何经验,尤其是Luigi.任何人都可以指出他们在Luigi上遇到的任何问题,或者可以证明它比AWS更好吗?任何其他建议相同.

提前谢谢.

解决方案

我不知道AWS如何进行编排,但是如果您打算随时扩展到至少数千个工作,则不建议投资在路易吉.Luigi对于中小型项目非常有用.它提供了一个出色的界面,用于定义作业并通过原子文件系统操作确保作业完成.但是,涉及Luigi的问题是运行作业的框架.Luigi要求与工作人员保持不断的沟通才能使他们运行,根据我的经验,当我尝试扩展规模时,这破坏了网络带宽.

对于我的研究,我将使用我的大学运行SLURM的集群计算网格,在轻到中型工作流程上生成10,000个任务的网络.我所有的任务都不需要花费很长时间即可完成,每个任务最多可能需要5分钟.我尝试了以下三种方法来有效地使用Luigi.

  1. SciLuigi的slurm任务,用于从中央Luigi工作者(不使用中央调度程序)向SLURM提交作业.如果您的工作将被快速接受并运行,则此方法效果很好.但是,由于每个工作程序都是一个新进程,因此它在调度节点上使用了不合理的资源量.此外,它破坏了您在系统中拥有的所有优先级.更好的方法是先分配许多工人,然后让他们继续从事工作.

  2. 我尝试的第二种方法就是这样.我在家庭服务器上启动了Luigi中央调度程序(因为否则我无法监视工作状态,就像上面的工作流一样),并在SLURM集群上启动了具有相同配置的工作程序,因此每个工作程序都可以运行实验的任何部分.问题是,即使使用500Mbps的互联网,过去约50名工人Luigi也会停止运行,因此我与服务器的Internet连接也将停止.因此,我开始只有50名工人来工作,这极大地减慢了我的工作流程.另外,每个工人都必须向中央调度程序注册每个工作(这是另一个巨大的痛点),而只有50个工人可能要花费数小时.

  3. 为减轻启动时间,我决定按其参数对根任务子树进行分区,然后将每个子树提交给SLURM.因此,现在的启动时间相当短,但是我失去了任何工人运行任何工作的能力,这仍然非常重要.而且,我仍然只能与约50名工人一起工作.完成子树后,我完成了最后一项工作以完成实验.

总而言之,Luigi非常适合中小型工作流程,但是一旦您开始执行1,000多个任务和工作人员,该框架很快就无法跟上.我希望我的经验可以对框架有所了解.

My team had a monolithic service for a small scale project but for a re-architecture and scaling, we are planning to move to cloud services of Amazon AWS and evaluating for orchestration whether to run Luigi as a container task or use AWS Step Functions instead? I don't have any experience with any of them especially Luigi. Can anyone point out any issues that they have seen with Luigi or how it can prove to be better than AWS if at all? Any other suggestions for the same.

Thanks in advance.

解决方案

I don't know about how AWS does orchestration, but if you are planning to at any time scale to at least thousands of jobs, I would not recommend investing in Luigi. Luigi is extremely useful for small to medium(ish) projects. It provides a fantastic interface for defining jobs and ensuring job completion through atomic filesystem actions. However, the problem when it comes to Luigi is the framework for running jobs. Luigi requires constant communication to workers for them to run, which in my own experience destroyed network bandwidth when I tried to scale.

For my research, I will generate a network of 10,000 tasks on a light to medium workflow, using my university's cluster computing grid which runs SLURM. All of my tasks don't take that long to complete, maybe 5 min max each. I have tried the following three methods to use Luigi efficiently.

  1. SciLuigi's slurm task to submit jobs to SLURM from a central luigi worker (not using central scheduler). This method works well if your jobs will be accepted quickly and run. However, it uses an unreasonable amount of resources on the scheduling node, as each worker is a new process. Further, it destroys any priority you would have in the system. A better method would be to first allocate many workers and then have them continually work on jobs.

  2. The second method I attempted was just that. I started the Luigi central scheduler on my home server (because otherwise I could not monitor the state of work, just like in the above workflow) and started up workers on the SLURM cluster that all had the same configuration, so each of them could run any part of the experiment. The problem was, even with 500Mbps internet, past ~50 workers Luigi would stop functioning and so would my internet connection to my server. So, I began running jobs with only 50 workers, which drastically slowed my workflow. In addition, each worker had to register each job with the central scheduler (another huge pain point), which could take hours with only 50 workers.

  3. To mitigate this startup time I decided to partition the root-task subtrees by their parameters and submit each to SLURM. So now the startup time is reasonably low, but I lost the ability for any worker to run any job, which is still pretty important. Also, I can still only work with ~50 workers. When I completed the subtrees, I ran one last job to finish the experiment.

In conclusion, Luigi is great for small to medium-small workflows, but once you start hitting 1,000+ tasks and workers, the framework quickly fails to keep up. I hope that my experiences provide some insight into the framework.

这篇关于AWS Step Functions与Luigi进行编排的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆