使用Apache Spark从Amazon S3解析文件 [英] Parsing files from Amazon S3 with Apache Spark
问题描述
我正在使用Apache Spark,我必须从Amazon S3解析文件。从Amazon S3路径获取文件时如何知道文件扩展名?
I am using Apache Spark and I have to parse files from Amazon S3. How would I know file extension while fetching the files from Amazon S3 path?
推荐答案
我建议关注Cloudera教程通过Spark访问Amazon S3中存储的数据
I suggest to follow Cloudera tutorial Accessing Data Stored in Amazon S3 through Spark
要从Spark应用程序访问存储在Amazon S3中的数据,您可以
使用Hadoop文件API(SparkContext.hadoopFile
,
JavaHadoopRDD.saveAsHadoopFile
,SparkContext.newAPIHadoopRDD
和
JavaHadoopRDD.saveAsNewAPIHadoopFile
)用于读取和写入RDD,
提供格式为s3a:// bucket_name / path / to / file的URL。 TXT
。
To access data stored in Amazon S3 from Spark applications, you could use Hadoop file APIs (
SparkContext.hadoopFile
,JavaHadoopRDD.saveAsHadoopFile
,SparkContext.newAPIHadoopRDD
, andJavaHadoopRDD.saveAsNewAPIHadoopFile
) for reading and writing RDDs, providing URLs of the forms3a://bucket_name/path/to/file.txt
.
您可以使用数据源API读写Spark SQL DataFrame。
You can read and write Spark SQL DataFrames using the Data Source API.
关于文件扩展名,解决方案很少。
你可以简单地通过文件名获取扩展名(即 file.txt
)。
Regarding the file extension, there are few solutions.
You could simply take the extension by the filename (i.e. file.txt
).
如果您的扩展程序被存储在S3存储桶中的文件删除,您仍然可以知道内容类型,查看为每个S3资源添加的元数据。
If your extensions were removed by files stored in your S3 buckets, you could still know the content-type looking at metadata added for each S3 resource.
http:// docs。 aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html
这篇关于使用Apache Spark从Amazon S3解析文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!