如何使用 WholeTextFiles 在 Spark 中读取 gz 文件 [英] How to read gz files in Spark using wholeTextFiles

查看:43
本文介绍了如何使用 WholeTextFiles 在 Spark 中读取 gz 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件夹,其中包含许多小的 .gz 文件(压缩的 csv 文本文件).我需要在我的 Spark 作业中读取它们,但问题是我需要根据文件名中的信息进行一些处理.因此,我没有使用:

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:

JavaRDD<<String>String> input = sc.textFile(...)

据我所知,我无法以这种方式访问​​文件名.相反,我使用了:

since to my understanding I do not have access to the file name this way. Instead, I used:

JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);

因为这样我得到了一对文件名和内容.然而,这样看来,输入阅读器并没有从gz文件中读取文本,而是读取了二进制乱码.

because this way I get a pair of file name and the content. However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.

所以,我想知道我是否可以将它设置为以某种方式读取文本,或者使用 sc.textFile(...)

So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)

推荐答案

您无法使用 WholeTextFiles 读取 gzipped 文件,因为它使用 CombineFileInputFormat 无法读取 gzipped 文件,因为 它们不可拆分(来源证明):

You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):

  override def createRecordReader(
      split: InputSplit,
      context: TaskAttemptContext): RecordReader[String, String] = {

    new CombineFileRecordReader[String, String](
      split.asInstanceOf[CombineFileSplit],
      context,
      classOf[WholeTextFileRecordReader])
  }

您可以将 newAPIHadoopFilewholefileinputformat(不是内置在 hadoop 中,而是在互联网上)以使其正常工作.

You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.

更新 1: 我不认为 WholeFileInputFormat 会起作用,因为它只是获取文件的字节,这意味着您可能必须编写自己的类,可能会扩展 WholeFileInputFormat 以确保它解压缩字节.

UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.

另一种选择是使用 GZipInputStream

更新 2:如果您可以像下面的 OP 评论一样访问目录名称,您可以获得所有这样的文件.

UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.

Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus []  fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());

这篇关于如何使用 WholeTextFiles 在 Spark 中读取 gz 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆