列出EMR上的S3文件夹 [英] list S3 folder on EMR
问题描述
我不明白如何在火花作业期间简单地在EMR上列出S3存储桶的内容. 我想做以下事情
I fail to understand how to simply list the contents of an S3 bucket on EMR during a spark job. I wanted to do the following
Configuration conf = spark.sparkContext().hadoopConfiguration();
FileSystem s3 = S3FileSystem.get(conf);
List<LocatedFileStatus> list = toList(s3.listFiles(new Path("s3://mybucket"), false))
这总是失败,并出现以下错误
This always fails with the following error
java.lang.IllegalArgumentException: Wrong FS: s3://*********/, expected: hdfs://**********.eu-central-1.compute.internal:8020
在hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020
in the hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020
如果我不仅使用/myfolder/myfile协议(而不是hdfs://myfolder/myfile),则我的理解方式将默认为df.defaultFS. 但是我希望,如果我指定s3://mybucket/,则fs.defaultFS应该没有关系.
The way I understand it if I don't use a protocol just /myfolder/myfile instead of i.e. hdfs://myfolder/myfile it will default to the df.defaultFS. But I would expect if I specify my s3://mybucket/ the fs.defaultFS should not matter.
一个人如何访问目录信息? spark.read.parquet("s3://mybucket/*.parquet")可以正常工作,但是对于此任务,我需要检查某些文件的存在并还希望删除一些文件.我以为org.apache.hadoop.fs.FileSystem是正确的工具.
How does one access the directory information? spark.read.parquet("s3://mybucket/*.parquet") works just fine but for this task I need to check the existence of some files and would also like to delete some. I assumed org.apache.hadoop.fs.FileSystem would be the correct tool.
PS:我也不了解日志记录的工作原理.如果我使用部署模式群集(我想从s3部署jar,而该s3无法在客户端模式下运行),则我只能在s3://logbucket/j -.../containers/application ...中找到日志./conatiner...0001.在S3中显示这些内容之前,要有很长的延迟.如何在主服务器上通过ssh找到它?还是有一些更快/更好的方法来检查Spark应用程序日志?
更新:只是在/mnt/var/log/hadoop-yarn/containers
下找到了它们,但是它是yarn:yarn拥有的,作为hadoop用户,我无法读取它. :(有想法吗?
PS: I also don't understand how logging works. If I use deploy-mode cluster (i want to deploy jars from s3 which does not work in client mode), the I can only find my logs in s3://logbucket/j-.../containers/application.../conatiner...0001. There is quite a long delay before those show in S3. How do I find it via ssh on the master? or is there some faster/better way to check spark application logs?
UPDATE: Just found them under /mnt/var/log/hadoop-yarn/containers
however the it is owned by yarn:yarn and as hadoop user I cannot read it. :( Ideas?
推荐答案
在我的情况下,我需要读取由先前的EMR作业生成的拼花文件,我正在寻找给定s3前缀的文件列表,但是很好事情是我们不需要做所有的事情,我们可以简单地做到这一点: spark.read.parquet(bucket + prefix_directory)
In my case I needed to read a parquet file that was generated by prior EMR jobs, I was looking for list of files for a given s3 prefix, but nice thing is we don't need to do all that, we can simply do this: spark.read.parquet(bucket+prefix_directory)
这篇关于列出EMR上的S3文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!