S3删除和HDFS到S3复制 [英] S3 Delete & HDFS to S3 Copy

查看:162
本文介绍了S3删除和HDFS到S3复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我的Spark 管道的一部分,我必须在EMR/S3上执行以下任务:

As a part of my Spark pipeline, I have to perform following tasks on EMR / S3:

  1. 删除 :(递归)删除给定S3 bucket
  2. 下的所有文件/目录
  3. 复制:将目录(子目录和文件)的内容复制到给定的S3 bucket
  1. Delete: (Recursively) Delete all files / directories under a given S3 bucket
  2. Copy: Copy contents of a directory (subdirectories & files) to a given S3 bucket


根据我目前的知识,Airflow不会为这些任务提供operator s/hook s.因此,我计划按以下方式实现它们:


Based on my current knowledge, Airflow doesn't provide operators / hooks for these tasks. I therefore plan to implement them as follows:

  1. 删除:扩展 S3Hook 添加在指定的S3 bucket
  2. 上执行aws s3 rm的功能
  3. 复制:使用 SSHExecuteOperator 执行hadoop distcp
  1. Delete: Extend S3Hook to add a function that performs aws s3 rm on specified S3 bucket
  2. Copy: Use SSHExecuteOperator to perform hadoop distcp


我的问题是:


My questions are:

  • 我认为我打算执行的任务是 primitive . Airflow已经提供了这些功能吗?
  • 如果没有,是否有比我计划做的更好的方法?
  • I reckon that the tasks I intend to perform are quite primitive. Are these functionalities already provided by Airflow?
  • If not, is there a better way to achieve this than what I plan to do?

我正在使用:

  • Airflow 1.9.0 [Python 3.6.6](一旦
  • Airflow 1.9.0 [Python 3.6.6] (will upgrade to Airflow 1.10 once it is released)
  • EMR 5.13.0

推荐答案

delete是一种基本操作,是,但hadoop distcp不是.要回答您的问题:

Well the delete is a primitive operation yes but not the hadoop distcp. To answer your questions:

  1. 没有气流在s3挂钩上没有执行这些操作的功能.
  2. 在我看来,通过创建自己的插件来扩展s3_hook并使用ssh运算符执行distcp,是一种很好的方法.

不确定标准S3_Hook为什么不具有删除功能.可能是因为 s3提供了最终一致的"一致性模型(可能不是原因,但还是要牢记在心)

Not sure why the standards S3_Hook does not have a delete function. It MAY be because s3 provides an "eventually consistent" Consistency Model (probably not the reason but good to keep in mind anyway)

这篇关于S3删除和HDFS到S3复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆