如何设置两台服务器的气流? [英] How do I setup an Airflow of 2 servers?

查看:82
本文介绍了如何设置两台服务器的气流?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试将Airflow流程拆分为2台服务器。服务器A已经在独立模式下运行,其中包含所有功能,并且具有DAG,我想将其设置为新设置中的工作服务器,并带有附加服务器。



服务器B是新服务器,它将在MySQL上托管元数据数据库。



我可以让Server A运行LocalExecutor,还是必须使用CeleryExecutor? 气流调度程序是否必须在具有DAG的服务器上运行?还是必须在集群中的每台服务器上运行?对于进程之间存在什么依赖性感到困惑

解决方案



好处



更高的可用性



如果其中一个工作节点发生故障或故障



分布式处理



如果您的工作流程中有多个内存密集型任务,那么任务将得到更好的分配,从而可以在整个群集中实现更高的数据利用率,并更快地执行任务。



规模化工人



水平



您可以通过向集群添加更多执行程序节点并使这些新节点减轻现有节点的负担来水平扩展集群并分配处理。由于工作人员不需要向任何中央机构注册即可开始处理任务,因此可以在不中断群集的情况下打开和关闭计算机。



垂直



您可以通过增加每个节点上运行的celeryd守护程序的数量来垂直扩展集群。这可以通过增加{AIRFLOW_HOME} /airflow.cfg文件中 celeryd_concurrency配置中的值来完成。



示例:

  celeryd_concurrency = 30 

您可能需要增加实例的大小,以支持更多的celeryd进程。这将取决于您在群集上运行的任务的内存和cpu强度。



扩展主节点



还可以将更多的主节点添加到群集中,以扩展在主节点上运行的服务。这主要是允许您扩展Web服务器守护程序,以防一台机器需要处理太多HTTP请求,或者您想为该服务提供更高的可用性。



要注意的一件事是一次只能运行一个Scheduler实例。如果有多个调度程序在运行,则可能会调度单个任务的多个实例。如果您正在运行某种ETL流程,这可能会导致工作流程出现一些重大问题,并导致重复数据出现在最终表中。



如果您愿意,还可以将Scheduler守护程序设置为在其自己的专用主节点上运行。





Apache Airflow群集设置步骤



必备条件




  • 以下节点具有给定的主机名:


    • master1-将具有角色:Web服务器,调度程序

    • master2-将具有角色:Web服务器

    • worker1-将具有角色(s):Worker

    • worker2-将具有以下角色:Worker


  • 队列服务正在运行。 (RabbitMQ,AWS SQS等)


    • 您可以按照以下说明安装RabbitMQ:安装RabbitMQ

    • 如果如果使用RabbitMQ,建议也将其设置为高可用性集群。设置负载均衡器以代理对RabbitMQ实例的请求。








其他文档




Trying to split out Airflow processes onto 2 servers. Server A, which has been already running in standalone mode with everything on it, has the DAGs and I'd like to set it as the worker in the new setup with an additional server.

Server B is the new server which would host the metadata database on MySQL.

Can I have Server A run LocalExecutor, or would I have to use CeleryExecutor? Would airflow scheduler has to run on the server that has the DAGs right? Or does it have to run on every server in a cluster? Confused as to what dependencies there are between the processes

解决方案

This article does an excellent job demonstrating how to cluster Airflow onto multiple servers.

Multi-Node (Cluster) Airflow Setup

A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.

Benefits

Higher Availability

If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.

Distributed Processing

If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.

Scaling Workers

Horizontally

You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.

Vertically

You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.

Example:

celeryd_concurrency = 30

You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.

Scaling Master Nodes

You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.

One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.

If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.

Apache Airflow Cluster Setup Steps

Pre-Requisites

  • The following nodes are available with the given host names:
    • master1 - Will have the role(s): Web Server, Scheduler
    • master2 - Will have the role(s): Web Server
    • worker1 - Will have the role(s): Worker
    • worker2 - Will have the role(s): Worker
  • A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
    • You can install RabbitMQ by following these instructions: Installing RabbitMQ
    • If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.

Additional Documentation

这篇关于如何设置两台服务器的气流?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆