列出EMR上的S3文件夹 [英] list S3 folder on EMR

查看：175 发布时间：2020/8/23 2:28:06 amazon-web-services hadoop amazon-s3 amazon-emr

本文介绍了列出EMR上的S3文件夹的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我不明白如何在火花作业期间简单地在EMR上列出S3存储桶的内容. 我想做以下事情

I fail to understand how to simply list the contents of an S3 bucket on EMR during a spark job. I wanted to do the following

Configuration conf = spark.sparkContext().hadoopConfiguration();
FileSystem s3 = S3FileSystem.get(conf);
List<LocatedFileStatus> list = toList(s3.listFiles(new Path("s3://mybucket"), false))

这总是失败，并出现以下错误

This always fails with the following error

java.lang.IllegalArgumentException: Wrong FS: s3://*********/, expected: hdfs://**********.eu-central-1.compute.internal:8020

在hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020

in the hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020

如果我不仅使用/myfolder/myfile协议(而不是hdfs://myfolder/myfile)，则我的理解方式将默认为df.defaultFS. 但是我希望，如果我指定s3://mybucket/，则fs.defaultFS应该没有关系.

The way I understand it if I don't use a protocol just /myfolder/myfile instead of i.e. hdfs://myfolder/myfile it will default to the df.defaultFS. But I would expect if I specify my s3://mybucket/ the fs.defaultFS should not matter.

一个人如何访问目录信息? spark.read.parquet("s3://mybucket/*.parquet")可以正常工作，但是对于此任务，我需要检查某些文件的存在并还希望删除一些文件.我以为org.apache.hadoop.fs.FileSystem是正确的工具.

How does one access the directory information? spark.read.parquet("s3://mybucket/*.parquet") works just fine but for this task I need to check the existence of some files and would also like to delete some. I assumed org.apache.hadoop.fs.FileSystem would be the correct tool.

PS:我也不了解日志记录的工作原理.如果我使用部署模式群集(我想从s3部署jar，而该s3无法在客户端模式下运行)，则我只能在s3://logbucket/j -.../containers/application ...中找到日志./conatiner...0001.在S3中显示这些内容之前，要有很长的延迟.如何在主服务器上通过ssh找到它?还是有一些更快/更好的方法来检查Spark应用程序日志? 更新:只是在/mnt/var/log/hadoop-yarn/containers下找到了它们，但是它是yarn:yarn拥有的，作为hadoop用户，我无法读取它. :(有想法吗?

PS: I also don't understand how logging works. If I use deploy-mode cluster (i want to deploy jars from s3 which does not work in client mode), the I can only find my logs in s3://logbucket/j-.../containers/application.../conatiner...0001. There is quite a long delay before those show in S3. How do I find it via ssh on the master? or is there some faster/better way to check spark application logs? UPDATE: Just found them under /mnt/var/log/hadoop-yarn/containers however the it is owned by yarn:yarn and as hadoop user I cannot read it. :( Ideas?

列出EMR上的S3文件夹 [英] list S3 folder on EMR

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

列出EMR上的S3文件夹 [英] list S3 folder on EMR

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭