解压缩Hadoop hdfs目录中的所有Gzip文件 [英] Decompress all Gzip files in a Hadoop hdfs directory
问题描述
在我的HDFS上,我有一堆gzip文件,我想将其解压缩为正常格式。有这样的API吗?或者我怎么写一个函数来做到这一点?
我不想使用任何命令行工具;相反,我想通过编写Java代码来完成此任务。
hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.htmlrel =nofollow> CompressionCodec
解压缩文件。 gzip的实现是 GzipCodec
。您将获得 CompressedInputStream
,并通过简单的IO输出结果。例如:说你有一个文件 file.gz
/ /文件的路径
String uri =/uri/to/file.gz;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri),conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
//正确的编解码器将被文件的扩展名发现
CompressionCodec codec = factory.getCodec(inputPath);
if(codec == null){
System.err.println(没有为+ uri找到编解码器;
System.exit(1);
}
//删除.gz扩展名
String outputUri =
CompressionCodecFactory.removeSuffix(uri,codec.getDefaultExtension());
InputStream is = codec.createInputStream(fs.open(inputPath));
OutputStream out = fs.create(new Path(outputUri));
IOUtils.copyBytes(is,out,conf);
//关闭流
<如果您需要获取目录中的所有文件,则应该获取
FileStatus / code> es就像
FileSystem fs = FileSystem.get(new Configuration());
FileStatus [] statuses = fs.listStatus(new Path(hdfs / path / to / dir));
然后循环
(FileStatus状态:状态){
CompressionCodec codec = factory.getCodec(status.getPath());
...
InputStream is = codec.createInputStream(fs.open(status.getPath());
...
}
On my HDFS, I have a bunch of gzip files that I want to decompress to a normal format. Is there an API for doing this? Or how could I write a function to do this?
I don't want to use any command-line tools; instead, I want to accomplish this task by writing Java code.
解决方案 You need a CompressionCodec
to decompress the file. The implementation for gzip is GzipCodec
. You get a CompressedInputStream
via the codec and out the result with simple IO. Something like this: say you have a file file.gz
//path of file
String uri = "/uri/to/file.gz";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
// the correct codec will be discovered by the extension of the file
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.println("No codec found for " + uri);
System.exit(1);
}
// remove the .gz extension
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream is = codec.createInputStream(fs.open(inputPath));
OutputStream out = fs.create(new Path(outputUri));
IOUtils.copyBytes(is, out, conf);
// close streams
UPDATE
If you need to get all the files in a directory, the you should get the FileStatus
es like
FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] statuses = fs.listStatus(new Path("hdfs/path/to/dir"));
Then just loop
for (FileStatus status: statuses) {
CompressionCodec codec = factory.getCodec(status.getPath());
...
InputStream is = codec.createInputStream(fs.open(status.getPath());
...
}
这篇关于解压缩Hadoop hdfs目录中的所有Gzip文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!