S3删除和HDFS到S3复制 [英] S3 Delete & HDFS to S3 Copy
问题描述
作为我的Spark
管道的一部分,我必须在EMR
/S3
上执行以下任务:
As a part of my Spark
pipeline, I have to perform following tasks on EMR
/ S3
:
- 删除 :(递归)删除给定
S3 bucket
下的所有文件/目录
- 复制:将目录(子目录和文件)的内容复制到给定的
S3 bucket
- Delete: (Recursively) Delete all files / directories under a given
S3 bucket
- Copy: Copy contents of a directory (subdirectories & files) to a given
S3 bucket
根据我目前的知识,Airflow
不会为这些任务提供operator
s/hook
s.因此,我计划按以下方式实现它们:
Based on my current knowledge, Airflow
doesn't provide operator
s / hook
s for these tasks. I therefore plan to implement them as follows:
- 删除:扩展
S3Hook
添加在指定的S3 bucket
上执行 - 复制:使用
SSHExecuteOperator
执行hadoop distcp
aws s3 rm
的功能
- Delete: Extend
S3Hook
to add a function that performsaws s3 rm
on specifiedS3 bucket
- Copy: Use
SSHExecuteOperator
to performhadoop distcp
我的问题是:
My questions are:
- 我认为我打算执行的任务是 primitive .
Airflow
已经提供了这些功能吗? - 如果没有,是否有比我计划做的更好的方法?
- I reckon that the tasks I intend to perform are quite primitive. Are these functionalities already provided by
Airflow
? - If not, is there a better way to achieve this than what I plan to do?
我正在使用:
Airflow 1.9.0
[Python 3.6.6
] (will upgrade toAirflow 1.10
once it is released)EMR 5.13.0
推荐答案
delete
是一种基本操作,是,但hadoop distcp
不是.要回答您的问题:
Well the delete
is a primitive operation yes but not the hadoop distcp
. To answer your questions:
- 没有气流在s3挂钩上没有执行这些操作的功能.
- 在我看来,通过创建自己的插件来扩展s3_hook并使用ssh运算符执行distcp,是一种很好的方法.
不确定标准S3_Hook为什么不具有删除功能.可能是因为 s3提供了最终一致的"一致性模型(可能不是原因,但还是要牢记在心)
Not sure why the standards S3_Hook does not have a delete function. It MAY be because s3 provides an "eventually consistent" Consistency Model (probably not the reason but good to keep in mind anyway)
这篇关于S3删除和HDFS到S3复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!