Spark-读取没有文件扩展名的压缩文件 [英] Spark - Read compressed files without file extension
问题描述
我有一个S3存储桶,其中装有没有文件扩展名的Gz文件.例如s3://mybucket/1234502827-34231
I have a S3 bucket that is filled with Gz files that have no file extension. For example s3://mybucket/1234502827-34231
sc.textFile
使用该文件扩展名选择解码器.我发现许多博客文章都介绍了如何处理自定义文件扩展名,但是却找不到丢失的文件扩展名.
sc.textFile
uses that file extension to select the decoder. I have found many blog post on handling custom file extensions but nothing about missing file extensions.
我认为解决方案可能是sc.binaryFiles
并手动解压缩文件.
I think the solution may be sc.binaryFiles
and unzipping the file manually.
Another possibility is to figure out how sc.textFile finds the file format. I'm not clear what these classOf[]
calls work.
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
推荐答案
您可以尝试将以下针对ZIP文件的解决方案与gzipFileInputFormat库相结合吗?
Can you try to combine the below solution for ZIP files, with gzipFileInputFormat library?
此处-如何通过Spark打开/流式处理.zip文件? 您可以使用ZIP查看如何操作:
here - How to open/stream .zip files through Spark? You can see how to do it using ZIP:
rdd1 = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());
gzipFileInputFormat:
gzipFileInputFormat:
有关newAPIHadoopFile()的一些详细信息可以在这里找到: http://spark.apache.org/docs/latest/api/python/pyspark.html
Some details about newAPIHadoopFile() can be found here: http://spark.apache.org/docs/latest/api/python/pyspark.html
这篇关于Spark-读取没有文件扩展名的压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!