使用Spark读取带有自定义扩展名的压缩文件 [英] Read a compressed file *with custom extension* with spark

查看:430
本文介绍了使用Spark读取带有自定义扩展名的压缩文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用等价的sc.textFile("path/to/file.Z")将gzip压缩文件读入RDD[String].

I want to read gzip compressed files into a RDD[String] using an equivalent of sc.textFile("path/to/file.Z").

除了我的文件扩展名不是gz而是Z之外,因此该文件不被识别为压缩文件.

Except my file extension if not gz but is Z instead, so the file is not recognised as being gzipped.

我无法重命名它们,因为这会破坏生产代码.我不想复制它们,因为它们很大且很多.我想我可以使用某种符号链接,但是我想先看看是否有使用scala/spark的方法(我现在在本地Windows机器上).

I cannot rename them as it would break production code. I do not want to copy them as they are massive and many. I guess I could use some kind of symlinks but I want to see if there is a way with scala/spark first (I am on my local windows machine for now).

如何有效读取此文件?

推荐答案

这里有解决此问题的解决方法

Here there's a workaround to fix this problem http://arjon.es/2015/10/02/reading-compressed-data-with-spark-using-unknown-file-extensions/

相关部分:

...扩展GzipCodec并覆盖getDefaultExtension方法.

...extend GzipCodec and override the getDefaultExtension method.

package smx.ananke.spark.util.codecs

import org.apache.hadoop.io.compress.GzipCodec

class TmpGzipCodec extends GzipCodec {

  override def getDefaultExtension(): String = ".gz.tmp" // You should change it to ".Z"

}

现在我们刚刚注册了此编解码器,设置 SparkConf上的spark.hadoop.io.compression.codecs:

Now we just registered this codec, setting spark.hadoop.io.compression.codecs on SparkConf:

val conf = new SparkConf()

// Custom Codec that process .gz.tmp extensions as a common Gzip format
conf.set("spark.hadoop.io.compression.codecs", "smx.ananke.spark.util.codecs.TmpGzipCodec")

val sc = new SparkContext(conf)

val data = sc.textFile("s3n://my-data-bucket/2015/09/21/13/*")

这篇关于使用Spark读取带有自定义扩展名的压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆