Airflow Scheduler 内存不足问题 [英] Airflow Scheduler out of memory problems

查看:193
本文介绍了Airflow Scheduler 内存不足问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在试验 Apache Airflow(版本 1.10rc2,使用 python 2.7)并将其部署到 kubernetes、webserver 和 scheduler 到不同的 pods,并且数据库也在使用 cloud sql,但是我们一直面临内存不足的问题与调度程序吊舱.

在 OOM 时,我们只运行了 4 个示例 Dag(大约 20 个任务).Pod 的内存为 1Gib.我在其他帖子中看到,一个任务在运行时可能会消耗大约 50Mib 的内存,并且所有任务操作都在内存中,没有任何内容刷新到磁盘,因此已经提供了 1Gb.

是否有任何经验法则可以用来计算基于并行任务的调度程序需要多少内存?

除了降低并行度外,是否有任何调整可以减少调度程序本身的内存使用?

我认为我们的用例不需要 Dask 或 Celery 来水平扩展 Airflow 并为工人提供更多机器.

关于配置的更多细节:

executor = Localexecutor
平行度 = 10
dag_concurrency = 5
max_active_runs_per_dag = 2
workers = 1
worker_concurrency = 16
min_file_process_interval = 1
min_file_parsing_loop_time = 5
dag_dir_list_interval = 30

当时运行的 dag 是 example_bash_operator、example_branch_operator、example_python_operator 和我们开发的一个 quickDag.

在某些情况下,所有这些都只是简单的任务/操作符,如 DummyOperators、BranchOperatos、BashOperators,但只执行 echo 或 sleep,PythonOperators 也只执行 sleep.总共有大约 40 个任务,但并非所有任务都并行运行,因为其中一些是下游、依赖等,我们的并行度设置为 10,如上所述只有一个工作线程,并且 dag_concurrency 设置为 5.

我在气流日志和任务日志中都看不到任何异常.

仅运行这些 dag 中的一个,似乎气流正在相应地工作.

我可以在调度程序 pod 中看到很多调度程序进程,每个进程使用 0.2% 或更多的内存:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
<代码>461384 气流 20 0 836700 127212 23908 S 36.5 0.4 0:01.19/usr/bin/python/usr/bin/airflow scheduler 461397 airflow 20 0 356168 86320/usr/bin/airflow scheduler 461397//usr/bin/airflow Scheduler/bin/airflow scheduler 44 气流 20 0 335920 71700 10600 S 28.9 0.2 403:32.05/usr/bin/python/usr/bin/airflow scheduler 56 气流 20 0 330548 59164 S 35020..>

这是使用 0.3% 内存运行的任务之一:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND462042 气流 20 0 282632 91120 10544 S 1.7 0.3 0:02.66/usr/bin/python/usr/bin/airflow run example_bash_operator runme_1 2018-08-29T07:39:430+-localusr/lib/python2.7/site-packages/apache_airflow-1.10.0-py2.7.egg/airflow/example_dags/example_bash_operator.py

解决方案

实际上并没有一个简明的经验法则可以遵循,因为它会因您的工作流程而异.

如您所见,调度程序将创建多个 fork 进程.此外,每个任务(Dummy 除外)都将在其自己的进程中运行.根据操作员和正在处理的数据,每个任务所需的内存量可能会有很大差异.

并行度设置将直接限制在所有 dag 运行/任务中同时运行的任务数量,这对您使用 LocalExecutor 的效果最为显着.您也可以尝试将 [scheduler] 下的 max_threads 设置为 1.

因此(非常)普遍的经验法则是善待资源:

[256 用于调度程序本身] + ( [并行性] * (100MB + [您将处理的数据大小]) )

根据您是加载完整数据集还是在任务执行过程中处理数据块,需要更改数据大小的位置.

即使您认为不需要扩展集群,我仍然建议使用 CeleryExecutor,如果只是为了将调度程序和任务彼此隔离.这样,如果您的调度程序或 celery 工作人员死了,它不会同时中断.特别是在 k8 中运行时,如果您的调度程序 sigterms 它将与任何正在运行的任务一起杀死它.如果您在不同的 pod 中运行它们并且调度程序 pod 重新启动,则您可以不间断地完成任务.如果你有更多的工人,它会减少来自其他任务的内存/处理尖峰的影响.

We are experimenting with Apache Airflow (version 1.10rc2, with python 2.7) and deploying it to kubernetes, webserver and scheduler to different pods, and the database is as well using cloud sql, but we have been facing out of memory problems with the scheduler pod.

At the moment of the OOM, we were running only 4 example Dags (approximately 20 tasks). The memory for the pod is 1Gib. I've seen in other posts that a task might consume approximately 50Mib of memory when running, and all task operations are in memory, nothing is flushed to disk, so that would give already 1Gb.

Is there any rule of thumb we can use to calculate how much memory would we need for the scheduler based on parallel tasks?

Is there any tuning, apart from decreasing the parallelism, that could be done in order to decrease the use of memory in the scheduler itself?

I don't think our use case would require Dask, or Celery to horizontally scale Airflow with more machines for the workers.

Just a few more details about the confguration:

executor = Localexecutor
parallelism = 10
dag_concurrency = 5
max_active_runs_per_dag = 2
workers = 1
worker_concurrency = 16
min_file_process_interval = 1
min_file_parsing_loop_time = 5
dag_dir_list_interval = 30

The dags running at the time were example_bash_operator, example_branch_operator, example_python_operator and one quickDag we have developed.

All of them just with simple tasks / operators like DummyOperators, BranchOperatos, BashOperators in some cases but doing only echo or sleep and PythonOperators doing only sleep as well. In total it would be aproximately 40 tasks, but not all of them were running in parallel because some of them were downstream, depencies and so on, and our parallelism is set to 10, with just a single worker as described above, and dag_concurrency is set to 5.

I cant see anything abnormal in the airflow logs, and neither in the task logs.

Running just one of these dags, it seems that airflow is working accordingly.

I can see a lot of scheduler processes in the scheduler pod, each one using 0.2% of memory or more:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
461384 airflow 20 0 836700 127212 23908 S 36.5 0.4 0:01.19 /usr/bin/python /usr/bin/airflow scheduler 461397 airflow 20 0 356168 86320 5044 R 14.0 0.3 0:00.42 /usr/bin/python /usr/bin/airflow scheduler 44 airflow 20 0 335920 71700 10600 S 28.9 0.2 403:32.05 /usr/bin/python /usr/bin/airflow scheduler 56 airflow 20 0 330548 59164 3524 S 0.0 0.2 0:00.02

And this is one of the tasks running using 0.3% of memory:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 462042 airflow 20 0 282632 91120 10544 S 1.7 0.3 0:02.66 /usr/bin/python /usr/bin/airflow run example_bash_operator runme_1 2018-08-29T07:39:48.193735+00:00 --local -sd /usr/lib/python2.7/site-packages/apache_airflow-1.10.0-py2.7.egg/airflow/example_dags/example_bash_operator.py

解决方案

There isn't really a concise rule of thumb to follow because it can vary so much based on your workflow.

As you've seen, the scheduler will create several fork processes. Also every task (except Dummy) will run in it's own process. Depending on the operator and data it's processing the amount of memory needed per task can vary wildly.

The parallelism setting will directly limit how many task are running simultaneously across all dag runs/tasks, which would have the most dramatic effect for you using the LocalExecutor. You can also try setting max_threads under [scheduler] to 1.

So a (very) general rule of thumb being gracious with resources:

[256 for scheduler itself] + ( [parallelism] * (100MB + [size of data you'll process]) )

Where size of data will need to change depending on whether you load a full dataset, or process chunks of it over the course of the execution of the task.

Even if you don't think you'll need to scale your cluster, I would still recommend using the CeleryExecutor, if only to isolate the scheduler and tasks from each other. That way if your scheduler or celery worker dies, it doesn't take both down. Especially running in k8, if your scheduler sigterms it's going to kill it along with any running tasks. If you run them in different pods and the scheduler pod restarts, you're tasks you can finish uninterrupted. If you have more workers, it would lessen the impact of memory/processing spikes from other tasks.

这篇关于Airflow Scheduler 内存不足问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆