使用DistCp使用和s3distcp我的电子病历工作，输出到HDFS问题 [英] Problems using distcp and s3distcp with my EMR job that outputs to HDFS

查看：364 发布时间：2015/12/1 13:43:17 amazon-web-services elastic-map-reduce amazon-emr emr

本文介绍了使用DistCp使用和s3distcp我的电子病历工作，输出到HDFS问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经运行在AWS上的电子病历工作，并存储在电子病历工作的HDFS输出。我然后试图通过DistCp使用或s3distcp结果复制到S3，但两者是失败，如下所述。（注：原因我不只是送我的电子病历工作的直接输出到S3是由于（目前未解决）问题我描述<一href="http://stackoverflow.com/questions/11169708/where-is-my-aws-emr-reducer-output-for-my-completed-job-should-be-on-s3-but-no">Where是我的AWS EMR减速机的输出为我完成的工作（应该是在S3上，但什么也没有）？

I've run a job on AWS's EMR, and stored the output in the EMR job's HDFS. I am then trying to copy the result to S3 via distcp or s3distcp, but both are failing as described below. (Note: the reason I'm not just sending my EMR job's output directly to S3 is due to the (currently unresolved) problem I describe in Where is my AWS EMR reducer output for my completed job (should be on S3, but nothing there)?

有关DistCp使用，我跑（以下这个职位的建议）：

For distcp, I run (following this post's recommendation):

elastic-mapreduce --jobflow <MY-JOB-ID> --jar \
s3://elasticmapreduce/samples/distcp/distcp.jar \
    --args -overwrite \
    --args hdfs:///output/myJobOutput,s3n://output/myJobOutput \
    --step-name "Distcp output to s3"

在错误日志（到/ mnt /无功/日志/ Hadoop的/措施/ 8），我得到：

In error log (/mnt/var/log/hadoop/steps/8), I get:

With failures, global counters are inaccurate; consider running with -i
Copy failed: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: <SOME-REQUEST-ID>, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: <SOME-EXT-REQUEST-ID>
        at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:548)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:288)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:170)
...

有关s3distcp，我运行（以下<一个href="http://docs.amazonwebservices.com/ElasticMa$p$pduce/latest/DeveloperGuide/UsingEMR_s3distcp.html"相对=nofollow>的s3distcp文档）：

For s3distcp, I run (following the s3distcp documentation):

elastic-mapreduce --jobflow <MY-JOB-ID> --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.0.4/s3distcp.jar \
--args '--src,/output/myJobOutput,--dest,s3n://output/myJobOutput'

在错误日志（到/ mnt /无功/日志/ Hadoop的/措施/ 9），我得到：

In the error log (/mnt/var/log/hadoop/steps/9), I get:

java.lang.RuntimeException: Reducer task failed to copy 1 files: hdfs://10.116.203.7:9000/output/myJobOutput/part-00000 etc
        at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.close(Unknown Source)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:537)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:428)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

任何想法，我做错了什么？

Any ideas what I'm doing wrong?

更新：有人在AWS论坛在响应的职位有关类似DistCp使用错误提到的IAM用户的用户权限，<击>，但我不知道这是什么意思（修改我还没有创建任何IAM用户，所以它是使用默认）;希望这有助于查明我的问题。

Update: Someone responding on the AWS Forums to a post about a similar distcp error mentions the IAM user user permissions, ~~but I don't know what this means~~ (edit: I haven't created any IAM users, so it is using the defaults); hopefully it helps pinpoint my problem.

更新2：我注意到，在NameNode的日志文件中此错误（当重新运行s3distcp）..我要去寻找到默认的电子病历的权限，看它是否是我的问题：

Update 2: I noticed this error in namenode log file (when re-running s3distcp).. I'm going to look into default EMR permissions to see if it is my problem:

2012-06-24 21:57:21,326 WARN org.apache.hadoop.security.ShellBasedUnixGroupsMapping (IPC Server handler 40 on 9000): got exception trying to get groups for user job_201206242009_0005
org.apache.hadoop.util.Shell$ExitCodeException: id: job_201206242009_0005: No such user

    at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
    at org.apache.hadoop.util.Shell.run(Shell.java:182)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
    at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:68)
    at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:45)
    at org.apache.hadoop.security.Groups.getGroups(Groups.java:79)
    at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:966)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.<init>(FSPermissionChecker.java:50)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5160)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkTraverse(FSNamesystem.java:5143)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:1992)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.getFileInfo(NameNode.java:837)
    ...

更新3：我联系AWS支持，他们没有看到一个问题，所以我现在等着听他们的工程团队回来。会后回来，因为我听到更多

Update 3: I contact AWS Support, and they didn't see a problem, so am now waiting to hear back from their engineering team. Will post back as I hear more

使用DistCp使用和s3distcp我的电子病历工作，输出到HDFS问题 [英] Problems using distcp and s3distcp with my EMR job that outputs to HDFS

问题描述

推荐答案

相关文章

云存储最新文章

热门教程

热门工具

登录关闭

使用DistCp使用和s3distcp我的电子病历工作，输出到HDFS问题 [英] Problems using distcp and s3distcp with my EMR job that outputs to HDFS

问题描述

推荐答案

相关文章

云存储最新文章

热门教程

热门工具

登录 关闭

登录关闭