如何使用来自 Apache Airflow 的 Docker Operator 的卷 [英] how to use volume with Docker Operator from Apache Airflow
问题描述
我正在开发一个 ETL 过程,以便使用 DockerOperator 与 Apache Airflow 进行调度和编排.我在 Windows 笔记本电脑上工作,所以我只能从 docker 容器内运行 Apache Airflow.我能够使用以下 docker-compose.yml 中指定的卷将带有配置文件(下面称为
文件驻留在我的项目根目录中.configs
)的文件夹安装到气流容器(下面称为 webserver)中docker-compose.yml
文件中的相关代码如下:
I am developing an ETL process to be scheduled and orchestrated with Apache Airflow using the DockerOperator. I am working on a Windows Laptop, so I can only run Apache Airflow from inside a docker container. I was able to mount a folder on my windows laptop with config files (called configs
below) into the airflow container (named webserver below) using a volume specified in the below docker-compose.yml
file residing in my project root directory. The relevant code from the docker-compose.yml
file can be seen below:
version: '2.1'
webserver:
build: ./docker-airflow
restart: always
privileged: true
depends_on:
- mongo
- mongo-express
environment:
- LOAD_EX=n
- EXECUTOR=Local
volumes:
- ./docker-airflow/dags:/usr/local/airflow/dags
# Volume for source code
- ./src:/src
- ./docker-airflow/workdir:/home/workdir
# configs folder as volume
- ./configs:/configs
# Mount the docker socket from the host (currently my laptop) into the webserver container so that the webserver container can create "sibbling" containers
- //var/run/docker.sock:/var/run/docker.sock # the two "//" are needed for windows OS
ports:
- 8081:8080
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
networks:
- mynet
现在我想将这个 configs
文件夹及其所有内容传递到由 DockerOperator 创建的容器.尽管此 configs
文件夹显然已挂载到网络服务器容器的文件系统中,但此 configs
文件夹完全为空,因此,我的 DAG 失败.DockerOperator 的代码如下:
Now I want to pass this configs
folder with all its content on to the containers which are created by the DockerOperator. Although this configs
folder was apparently mounted into the webserver container's file system, this configs
folder is completely empty and because of that, my DAG fails. The code for the DockerOperator is as follows:
cmd = "--config_filepath {} --data_object_name {}".format("/configs/dev.ini", some_data_object)
staging_op = DockerOperator(
command=cmd,
task_id="my_task",
image="{}/{}:{}".format(docker_hub_username, docker_hub_repo_name, image_name),
api_version="auto",
auto_remove=False,
network_mode=docker_network,
force_pull=True,
volumes=["/configs:/configs"] # "absolute_path_host:absolute_path_container"
)
根据文档,卷的左侧必须是主机上的绝对路径,(如果我理解正确的话)在这种情况下是网络服务器容器(因为它为每个任务创建单独的容器).卷的右侧是由 DockerOperator 创建的任务容器内的目录.如上所述,任务容器内的 configs
文件夹确实存在,但完全是空的.有谁知道为什么会这样以及如何解决它?
According to the documentation, the left side of the volume must be an absolute path on the host, which (if I understood correctly) is the webserver container in this case (because it creates separate containers for every task). The right side of the volume is a directory inside the task's container which is created by the DockerOperator. As mentioned above, the configs
folder inside the task's container does exist, but is completely empty. Does anyone know why this is the case and how to fix it?
非常感谢您的帮助!
推荐答案
在实施来自 此处,DockerOperator的构造函数中的volumes需要指定如下:
After implemententing the suggestions from here, the volumes in the constructor of the DockerOperator need to be specified as follows:
cmd = "--config_filepath {} --data_object_name {}".format("/configs/dev.ini", some_data_object)
staging_op = DockerOperator(
command=cmd,
task_id="my_task",
image="{}/{}:{}".format(docker_hub_username, docker_hub_repo_name, image_name),
api_version="auto",
auto_remove=False,
network_mode=docker_network,
force_pull=True,
volumes=['/c/Users/kevin/dev/myproject/app/configs:/app/configs'] # "absolute_path_host:absolute_path_container"
)
也许文件路径需要像这样,因为 Docker 在 Windows 上的虚拟机中运行?
Maybe the file paths need to look like that, because Docker runs inside a VM on Windows?
正如@sarnu 还提到的,重要的是要了解主机端路径是我的 Windows 笔记本电脑上的路径,因为为每个任务创建的容器并行运行/是气流容器的同级容器.
As @sarnu also mentioned, it is important to understand, that the host-side paths are paths on my windows laptop, because the containers created for each task run in parallel / are sibbling containers to the airflow container.
这篇关于如何使用来自 Apache Airflow 的 Docker Operator 的卷的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!