如何读取使用星火wholeTextFiles GZ文件 [英] How to read gz files in Spark using wholeTextFiles

查看:902
本文介绍了如何读取使用星火wholeTextFiles GZ文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含许多小文件名为.gz(COM pressed CSV文本文件)的文件夹。我需要在我的工作星火读他们,但事情是我需要在此基础上是在文件名中的信息做一些处理。所以,我没有用:

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:

JavaRDD<<String>String> input = sc.textFile(...)

因为我的了解,我没有访问到文件名这种方式。相反,我用:

since to my understanding I do not have access to the file name this way. Instead, I used:

JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);

因为这样,我得到一个对文件名和内容。
然而,似乎这种方式,输入读取器无法读取来自GZ文件中的文本,而是读取二进制乱码

because this way I get a pair of file name and the content. However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.

所以,我想知道我是否可以将其设置以某种方式阅读文本,或者使用访问该文件的名称 sc.textFile(...)

So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)

推荐答案

您无法读取wholeTextFiles gzip文件,因为它使用CombineFileInputFormat无法读取gzip压缩的文件,因为的他们不分割的(来源证明它):

You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):

  override def createRecordReader(
      split: InputSplit,
      context: TaskAttemptContext): RecordReader[String, String] = {

    new CombineFileRecordReader[String, String](
      split.asInstanceOf[CombineFileSplit],
      context,
      classOf[WholeTextFileRecordReader])
  }

您可能能够使用 newAPIHadoopFile wholefileinputformat (没有内置的Hadoop,但都在互联网上)得到这个正常工作。

You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.

更新1:我不认为WholeFileInputFormat会工作,因为它只是获取文件的字节数,这意味着你可能必须有可能写自己的类扩展WholeFileInputFormat,以确保它DECOM presses字节。

UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.

另一种选择是将DECOM preSS自己使用的字节<一个href=\"http://stackoverflow.com/questions/270268/how-to-decom$p$pss-a-gzipped-data-in-a-byte-array\">GZipInputStream

Another option would be to decompress the bytes yourself using GZipInputStream

更新2:如果您有权访问该目录的名称,如在OP的评论下面你可以得到所有的文件,这样

UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.

Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus []  fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());

这篇关于如何读取使用星火wholeTextFiles GZ文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆