使用 Amazon EC2/S3 将本地数据复制到 Hadoop 集群上的 HDFS 的问题 [英] Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3
问题描述
我在 Amazon EC2 上设置了一个包含 5 个节点的 Hadoop 集群.现在,当我登录主节点并提交以下命令时
I have setup a Hadoop cluster containing 5 nodes on Amazon EC2. Now, when i login into the Master node and submit the following command
bin/hadoop jar <program>.jar <arg1> <arg2> <path/to/input/file/on/S3>
它抛出以下错误(不是同时).当我不将斜杠替换为%2F"时抛出第一个错误,当我将它们替换为%2F"时抛出第二个错误:
It throws the following errors (not at the same time.) The first error is thrown when i don't replace the slashes with '%2F' and the second is thrown when i replace them with '%2F':
1) Java.lang.IllegalArgumentException: Invalid hostname in URI S3://<ID>:<SECRETKEY>@<BUCKET>/<path-to-inputfile>
2) org.apache.hadoop.fs.S3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/' XML Error Message: The request signature we calculated does not match the signature you provided. check your key and signing method.
注意:
1) 当我提交 jps 以查看 Master 上正在运行哪些任务时,它只是显示
1)when i submitted jps to see what tasks were running on the Master, it just showed
1116 NameNode
1699 Jps
1180 JobTracker
离开 DataNode 和 TaskTracker.
leaving DataNode and TaskTracker.
2)我的密钥包含两个/"(正斜杠).我将它们替换为 S3 URI 中的%2F".
2)My Secret key contains two '/' (forward slashes). And i replace them with '%2F' in the S3 URI.
PS:当在单个节点上运行时,程序在 EC2 上运行良好.只有当我启动集群时,我才会遇到与将数据复制到 S3 或从 HDFS 复制到 HDFS 相关的问题.还有,distcp 有什么作用?即使在将数据从 S3 复制到 HDFS 之后,我还需要分发数据吗?(我认为 HDFS 在内部处理了这个问题)
PS: The program runs fine on EC2 when run on a single node. Its only when i launch a cluster, i run into issues related to copying data to/from S3 from/to HDFS. And, what does distcp do? Do i need to distribute the data even after i copy the data from S3 to HDFS?(I thought, HDFS took care of that internally)
如果您可以将我指向一个链接,该链接解释了如何使用 Amazon EC2/S3 在 hadoop 集群上运行 Map/reduce 程序.那太好了.
IF you could direct me to a link that explains running Map/reduce programs on a hadoop cluster using Amazon EC2/S3. That would be great.
问候,
Deepak.
推荐答案
你也可以Apache Whirr 用于此工作流程.查看快速入门指南和5 分钟指南 了解更多信息.
You can also you Apache Whirr for this workflow. Check the Quick Start Guide and the 5 minutes guide for more info.
免责声明:我是提交者之一.
Disclaimer: I'm one of the committers.
这篇关于使用 Amazon EC2/S3 将本地数据复制到 Hadoop 集群上的 HDFS 的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!