列出EMR上的S3文件夹 [英] list S3 folder on EMR

查看:175
本文介绍了列出EMR上的S3文件夹的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不明白如何在火花作业期间简单地在EMR上列出S3存储桶的内容. 我想做以下事情

I fail to understand how to simply list the contents of an S3 bucket on EMR during a spark job. I wanted to do the following

Configuration conf = spark.sparkContext().hadoopConfiguration();
FileSystem s3 = S3FileSystem.get(conf);
List<LocatedFileStatus> list = toList(s3.listFiles(new Path("s3://mybucket"), false))

这总是失败,并出现以下错误

This always fails with the following error

java.lang.IllegalArgumentException: Wrong FS: s3://*********/, expected: hdfs://**********.eu-central-1.compute.internal:8020

在hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020

in the hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020

如果我不仅使用/myfolder/myfile协议(而不是hdfs://myfolder/myfile),则我的理解方式将默认为df.defaultFS. 但是我希望,如果我指定s3://mybucket/,则fs.defaultFS应该没有关系.

The way I understand it if I don't use a protocol just /myfolder/myfile instead of i.e. hdfs://myfolder/myfile it will default to the df.defaultFS. But I would expect if I specify my s3://mybucket/ the fs.defaultFS should not matter.

一个人如何访问目录信息? spark.read.parquet("s3://mybucket/*.parquet")可以正常工作,但是对于此任务,我需要检查某些文件的存在并还希望删除一些文件.我以为org.apache.hadoop.fs.FileSystem是正确的工具.

How does one access the directory information? spark.read.parquet("s3://mybucket/*.parquet") works just fine but for this task I need to check the existence of some files and would also like to delete some. I assumed org.apache.hadoop.fs.FileSystem would be the correct tool.

PS:我也不了解日志记录的工作原理.如果我使用部署模式群集(我想从s3部署jar,而该s3无法在客户端模式下运行),则我只能在s3://logbucket/j -.../containers/application ...中找到日志./conatiner...0001.在S3中显示这些内容之前,要有很长的延迟.如何在主服务器上通过ssh找到它?还是有一些更快/更好的方法来检查Spark应用程序日志? 更新:只是在/mnt/var/log/hadoop-yarn/containers下找到了它们,但是它是yarn:yarn拥有的,作为hadoop用户,我无法读取它. :(有想法吗?

PS: I also don't understand how logging works. If I use deploy-mode cluster (i want to deploy jars from s3 which does not work in client mode), the I can only find my logs in s3://logbucket/j-.../containers/application.../conatiner...0001. There is quite a long delay before those show in S3. How do I find it via ssh on the master? or is there some faster/better way to check spark application logs? UPDATE: Just found them under /mnt/var/log/hadoop-yarn/containers however the it is owned by yarn:yarn and as hadoop user I cannot read it. :( Ideas?

推荐答案

在我的情况下,我需要读取由先前的EMR作业生成的拼花文件,我正在寻找给定s3前缀的文件列表,但是很好事情是我们不需要做所有的事情,我们可以简单地做到这一点: spark.read.parquet(bucket + prefix_directory)

In my case I needed to read a parquet file that was generated by prior EMR jobs, I was looking for list of files for a given s3 prefix, but nice thing is we don't need to do all that, we can simply do this: spark.read.parquet(bucket+prefix_directory)

这篇关于列出EMR上的S3文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆