在EMR上在运行时推断HDFS路径 [英] Deduce the HDFS path at runtime on EMR

查看:135
本文介绍了在EMR上在运行时推断HDFS路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我生成了一个带有EMR步骤的EMR集群,以使用s3-dist-cp将文件从S3复制到HDFS,反之亦然. 该群集是按需群集,因此我们无法跟踪ip.

I have spawned an EMR cluster with an EMR step to copy a file from S3 to HDFS and vice-versa using s3-dist-cp. This cluster is an on-demand cluster so we are not keeping track of the ip.

EMR的第一步是: hadoop fs -mkdir /input-此步骤成功完成.

The first EMR step is: hadoop fs -mkdir /input - This step completed successfully.

第二个EMR步骤是: 以下是我正在使用的命令:

The second EMR step is: Following is the command I am using:

s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://<bucket-name>/<folder-name>/sample.txt --dest=hdfs:///input-此步骤失败

我收到以下异常错误:

错误:java.lang.IllegalArgumentException:java.net.UnknownHostException:sample.txt 在org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) 在org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310) 在org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) 在org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:678) 在org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:619) 在org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) 在org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717) 在org.apache.hadoop.fs.FileSystem.access $ 200(FileSystem.java:93) 在org.apache.hadoop.fs.FileSystem $ Cache.getInternal(FileSystem.java:2751) 在org.apache.hadoop.fs.FileSystem $ Cache.get(FileSystem.java:2733) 在org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377) 在org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) 在com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:213) 在com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:28) 在org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) 在org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635) 在org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) 在org.apache.hadoop.mapred.YarnChild $ 2.run(YarnChild.java:164) 在java.security.AccessController.doPrivileged(本机方法) 在javax.security.auth.Subject.doAs(Subject.java:422) 在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) 在org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) 引起原因:java.net.UnknownHostException:sample.txt

Error: java.lang.IllegalArgumentException: java.net.UnknownHostException: sample.txt at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:678) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:619) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:213) at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:28) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.net.UnknownHostException: sample.txt

但是该文件确实存在于S3上,我可以通过EMR上的spark应用程序读取它.

But this file does exist on S3 and I can read it through my spark application on EMR.

推荐答案

解决方案是在使用s3-dist-cp时,在源和目标位置均不应提及文件名.

The solution was while using s3-dist-cp , filename should not be mentioned in both source and destination.

如果要过滤src目录中的文件,可以使用--srcPattern选项

If you want to filter files in the src directory, you can use --srcPattern option

例如:s3-dist-cp --s3Endpoint = s3.amazonaws.com --src = s3:////--dest = hdfs:///input/--srcPattern = sample.txt.*

eg: s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://// --dest=hdfs:///input/ --srcPattern=sample.txt.*

这篇关于在EMR上在运行时推断HDFS路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆