解压缩Hadoop hdfs目录中的所有Gzip文件 [英] Decompress all Gzip files in a Hadoop hdfs directory

查看:1634
本文介绍了解压缩Hadoop hdfs目录中的所有Gzip文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的HDFS上,我有一堆gzip文件,我想将其解压缩为正常格式。有这样的API吗?或者我怎么写一个函数来做到这一点?



我不想使用任何命令行工具;相反,我想通过编写Java代码来完成此任务。

hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.htmlrel =nofollow> CompressionCodec 解压缩文件。 gzip的实现是 GzipCodec 。您将获得 CompressedInputStream ,并通过简单的IO输出结果。例如:说你有一个文件 file.gz

  / /文件的路径
String uri =/uri/to/file.gz;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri),conf);
Path inputPath = new Path(uri);

CompressionCodecFactory factory = new CompressionCodecFactory(conf);
//正确的编解码器将被文件的扩展名发现
CompressionCodec codec = factory.getCodec(inputPath);

if(codec == null){
System.err.println(没有为+ uri找到编解码器;
System.exit(1);
}

//删除.gz扩展名
String outputUri =
CompressionCodecFactory.removeSuffix(uri,codec.getDefaultExtension());

InputStream is = codec.createInputStream(fs.open(inputPath));
OutputStream out = fs.create(new Path(outputUri));
IOUtils.copyBytes(is,out,conf);

//关闭流






<如果您需要获取目录中的所有文件,则应该获取 FileStatus

/ code> es就像

  FileSystem fs = FileSystem.get(new Configuration()); 
FileStatus [] statuses = fs.listStatus(new Path(hdfs / path / to / dir));

然后循环

 (FileStatus状态:状态){
CompressionCodec codec = factory.getCodec(status.getPath());
...
InputStream is = codec.createInputStream(fs.open(status.getPath());
...
}


On my HDFS, I have a bunch of gzip files that I want to decompress to a normal format. Is there an API for doing this? Or how could I write a function to do this?

I don't want to use any command-line tools; instead, I want to accomplish this task by writing Java code.

解决方案

You need a CompressionCodec to decompress the file. The implementation for gzip is GzipCodec. You get a CompressedInputStream via the codec and out the result with simple IO. Something like this: say you have a file file.gz

//path of file
String uri = "/uri/to/file.gz";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);

CompressionCodecFactory factory = new CompressionCodecFactory(conf);
// the correct codec will be discovered by the extension of the file
CompressionCodec codec = factory.getCodec(inputPath);

if (codec == null) {
    System.err.println("No codec found for " + uri);
    System.exit(1);
}

// remove the .gz extension
String outputUri =
    CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());

InputStream is = codec.createInputStream(fs.open(inputPath));
OutputStream out = fs.create(new Path(outputUri));
IOUtils.copyBytes(is, out, conf);

// close streams


UPDATE

If you need to get all the files in a directory, the you should get the FileStatuses like

FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] statuses = fs.listStatus(new Path("hdfs/path/to/dir"));

Then just loop

for (FileStatus status: statuses) {
    CompressionCodec codec = factory.getCodec(status.getPath());
    ...
    InputStream is = codec.createInputStream(fs.open(status.getPath());
    ...
}

这篇关于解压缩Hadoop hdfs目录中的所有Gzip文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆