在不启动新集群的情况下从 Airflow 触发 Databricks 作业 [英] Triggering Databricks job from Airflow without starting new cluster

查看:34
本文介绍了在不启动新集群的情况下从 Airflow 触发 Databricks 作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用气流来触发数据块上的作业.我有许多运行数据块作业的 DAG,我希望只使用一个集群而不是多个集群,因为据我所知,这将降低这些任务产生的成本.

I am using airflow to trigger jobs on databricks. I have many DAGs running databricks jobs and I whish to have to use only one cluster instead of many, since to my understanding this will reduce the costs these task will generate.

使用DatabricksSubmitRunOperator有两种方法可以在数据块上运行作业.要么使用正在运行的集群通过 id 调用它

Using DatabricksSubmitRunOperatorthere are two ways to run a job on databricks. Either using a running cluster calling it by id

'existing_cluster_id' : '1234-567890-word123',

或者启动一个新的集群

'new_cluster': {
    'spark_version': '2.1.0-db3-scala2.11',
    'num_workers': 2
  },

现在我想尽量避免为每个任务启动一个新集群,但是集群在停机期间关闭,因此它不再通过它的 id 可用,我会得到一个错误,所以我的唯一选择视图是一个新集群.

Now I would like to try to avoid to start a new cluster for each task, however the cluster shuts down during downtime hence it will not be available trough it's id anymore and I will get an error, so the only option in my view is a new cluster.

1) 有没有办法让集群即使在关闭时也可以通过 id 调用?

1) Is there a way to have a cluster being callable by id even when it is down?

2) 人们是否只是让集群保持活力?

2) Do people simply keep the clusters alive?

3) 还是我完全错了,为每个任务启动集群不会产生更多成本?

3) Or am I completely wrong and starting clusters for each task won't generate more costs?

4) 有什么我完全错过了吗?

4) Is there something I missed completely?

推荐答案

更新基于@YannickSSE 的评论回复
我不使用数据块;您是否可以通过与您可能或可能不希望正在运行的集群相同的 id 启动一个新集群,并且在它正在运行的情况下让它成为一个空操作?也许不是,或者你可能不会问这个.响应:不能在启动新集群时提供 ID.

你能写一个 python 或 bash 操作符来测试集群的存在吗?(响应:这将是一个测试作业提交……不是最好的方法.)如果它找到并成功,下游任务将使用现有集群 ID 触发您的作业,但如果它没有另一个下游任务可以使用 trigger_rule all_failed 使用新集群执行相同的任务.然后这两个任务 DatabricksSubmitRunOperator 可以有一个带有 trigger_rule one_success 的下游任务.(响应:或者使用分支运算符来确定执行的运算符.)

Could you write a python or bash operator which tests for the existence of the cluster? (Response: This would be a test job submission… not the best approach.) If it finds it and succeeds the downstream task would trigger your job with the existing cluster id, but if it doesn't another downstream task could use the trigger_rule all_failed to do the same task but with a new cluster. Then both those task DatabricksSubmitRunOperators could have one downstream task with the trigger_rule one_success. (Response: Or use a branching operator to determine the operator executed.)

这可能并不理想,因为我认为您的集群 ID 会不时更改,导致您必须跟上.…集群是该操作员的databricks钩子连接的一部分,并且可以更新吗?也许您想在需要它的任务中将它指定为 {{ var.value._cluster_id }} 并保持更新为气流变量.(响应:集群 ID 不在钩子中,因此变量或 DAG 文件无论何时更改都必须更新.)

It might not be ideal because I imagine then that your cluster id is changing from time to time causing you to have to keep up. … Is the cluster part of the databricks hook's connection for that operator, and something that can be updated? Maybe you want to specify it in the tasks that need it as {{ var.value.<identifying>_cluster_id }} and keep it updated as an airflow variable. (Response: the cluster id is not in the hook, so the variable or DAG file would have to be updated whenever it changes.)

这篇关于在不启动新集群的情况下从 Airflow 触发 Databricks 作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆