Spark-读取没有文件扩展名的压缩文件 [英] Spark - Read compressed files without file extension

查看:141
本文介绍了Spark-读取没有文件扩展名的压缩文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个S3存储桶,其中装有没有文件扩展名的Gz文件.例如s3://mybucket/1234502827-34231

I have a S3 bucket that is filled with Gz files that have no file extension. For example s3://mybucket/1234502827-34231

sc.textFile使用该文件扩展名选择解码器.我发现许多博客文章都介绍了如何处理自定义文件扩展名,但是却找不到丢失的文件扩展名.

sc.textFile uses that file extension to select the decoder. I have found many blog post on handling custom file extensions but nothing about missing file extensions.

我认为解决方案可能是sc.binaryFiles并手动解压缩文件.

I think the solution may be sc.binaryFiles and unzipping the file manually.

另一种可能性是弄清楚

Another possibility is to figure out how sc.textFile finds the file format. I'm not clear what these classOf[] calls work.

  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

推荐答案

您可以尝试将以下针对ZIP文件的解决方案与gzipFileInputFormat库相结合吗?

Can you try to combine the below solution for ZIP files, with gzipFileInputFormat library?

此处-如何通过Spark打开/流式处理.zip文件? 您可以使用ZIP查看如何操作:

here - How to open/stream .zip files through Spark? You can see how to do it using ZIP:

rdd1  = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());

gzipFileInputFormat:

gzipFileInputFormat:

有关newAPIHadoopFile()的一些详细信息可以在这里找到: http://spark.apache.org/docs/latest/api/python/pyspark.html

Some details about newAPIHadoopFile() can be found here: http://spark.apache.org/docs/latest/api/python/pyspark.html

这篇关于Spark-读取没有文件扩展名的压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆