从Elastic MapReduce HDFS获取数据 [英] Getting data in and out of Elastic MapReduce HDFS

查看：98 发布时间：2018/5/31 19:17:21 hadoop elastic-map-reduce

本文介绍了从Elastic MapReduce HDFS获取数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我编写了一个Hadoop程序，它需要HDFS中的特定布局，之后我需要将这些文件从HDFS中取出。它适用于我的单节点Hadoop设置，我渴望在Elastic MapReduce中的10个节点上工作。

我一直在做的是类似于这：

  ./ elastic-mapreduce --create --alive 
 JOBID =j-XXX＃output从创建
 ./elastic-mapreduce -j $ JOBID --sshhadoop fs -cp s3：// bucket-id / XXX / XXX
 ./elastic-mapreduce -j $ JOBID  - jar s3：//bucket-id/jars/hdeploy.jar --main-class com.ranjan.HadoopMain --arg / XXX

这是异步的，但当作业完成时，我可以做到这一点。

  ./ elastic- mapreduce -j $ JOBID --sshhadoop fs -cp / XXX s3：// bucket-id / XXX-output
 ./elastic-mapreduce -j $ JOBID  - 终止

所以虽然这种 sort-of 有效，但它很笨重，而不是我想要的。是否有更干净的方式来做到这一点？

谢谢！
解决方案
您可以使用 distcp 将文件复制为mapreduce作业

＃下载从s3
$ hadoop distcp s3：// bucket / path / on / s3 // target / path / on / hdfs /
＃上传到s3
$ hadoop distcp / source / path / on / hdfs / s3：// bucket / path / on / s3 /

这使得整个群集可以从s3。

（注意：每个路径的尾部斜杠对从目录复制到目录都很重要）
I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it working on 10's of nodes within Elastic MapReduce.

What I've been doing is something like this:
./elastic-mapreduce --create --alive JOBID="j-XXX" # output from creation ./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp s3://bucket-id/XXX /XXX" ./elastic-mapreduce -j $JOBID --jar s3://bucket-id/jars/hdeploy.jar --main-class com.ranjan.HadoopMain --arg /XXX
This is asynchronous, but when the job's completed, I can do this
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp /XXX s3://bucket-id/XXX-output" ./elastic-mapreduce -j $JOBID --terminate
So while this sort-of works, but it's clunky and not what I'd like. Is there cleaner way to do this?

Thanks!
解决方案
You can use distcp which will copy the files as a mapreduce job
# download from s3 $ hadoop distcp s3://bucket/path/on/s3/ /target/path/on/hdfs/ # upload to s3 $ hadoop distcp /source/path/on/hdfs/ s3://bucket/path/on/s3/
This makes use of your entire cluster to copy in parallel from s3.

(note: the trailing slashes on each path are important to copy from directory to directory)

这篇关于从Elastic MapReduce HDFS获取数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从Elastic MapReduce HDFS获取数据 [英] Getting data in and out of Elastic MapReduce HDFS

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

从Elastic MapReduce HDFS获取数据 [英] Getting data in and out of Elastic MapReduce HDFS

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭