Apache Airflow:执行器报告任务实例完成(失败),尽管任务说其已排队 [英] Apache Airflow: Executor reports task instance finished (failed) although the task says its queued

查看:455
本文介绍了Apache Airflow:执行器报告任务实例完成(失败),尽管任务说其已排队的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的气流装置正在使用CeleryExecutor。
并发配置为

Our airflow installation is using CeleryExecutor. The concurrency configs were

# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 16

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16

# Are DAGs paused by default at creation
dags_are_paused_at_creation = True

# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 64

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
[celery]
# This section only applies if you are using the CeleryExecutor in
# [core] section above

# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor

# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
celeryd_concurrency = 16

我们有每天执行的dag。它围绕着一些任务并行执行,该模式遵循一种模式,该模式可以检测数据是否存在于hdfs中,然后休眠10分钟,最后上传到s3。

We have a dag that executes daily. It has around some tasks in parallel following a pattern that senses whether the data exists in hdfs then sleep 10 mins, and finally upload to s3.

其中一些任务已经完成遇到以下错误:

Some of the tasks has been encountering the following error:

2019-05-12 00:00:46,212 ERROR - Executor reports task instance <TaskInstance: example_dag.task1 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
2019-05-12 00:00:46,558 INFO - Marking task as UP_FOR_RETRY
2019-05-12 00:00:46,561 WARNING - section/key [smtp/smtp_user] not found in config

在这些任务中随机发生这种错误。发生此错误时,任务实例的状态立即设置为up_for_retry,并且工作节点中没有日志。重试后,它们执行并最终完成。

This kind of error occurs randomly in those tasks. When this error happens, the state of task instance is immediately set to up_for_retry, and no logs in the worker nodes. After some retries, they execute and finished eventually.

此问题有时会给我们带来很大的ETL延迟。有人知道如何解决这个问题吗?

This problem sometimes gives us large ETL delay. Anyone knows how to solve this problem?

推荐答案

我们已经解决了此问题。让我回答自己一个问题:

We fixed this already. Let me answer myself question:

我们有5个气流工作节点。安装花之后,监视分配给这些节点的任务。我们发现失败的任务总是发送到特定节点。我们尝试使用airflow test命令在其他节点上运行任务,并且它们起作用了。最终,原因是该特定节点中的python软件包错误。

We have 5 airflow worker nodes. After installing flower to monitor the tasks distributed to these nodes. We found out that the failed task was always sent to a specific node. We tried to use airflow test command to run the task in other nodes and they worked. Eventually, the reason was a wrong python package in that specific node.

这篇关于Apache Airflow:执行器报告任务实例完成(失败),尽管任务说其已排队的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆