如何在hadoop的新目录中解压缩.gz文件? [英] How to unzip .gz files in a new directory in hadoop?

查看:76
本文介绍了如何在hadoop的新目录中解压缩.gz文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 hdfs 的文件夹中有一堆 .gz 文件.我想将所有这些 .gz 文件解压缩到 hdfs 中的一个新文件夹中.我该怎么做?

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

推荐答案

我可以想到通过 3 种不同的方式来实现.

I can think of achieving it through 3 different ways.

  1. 使用 Linux 命令行

以下命令对我有用.

Following command worked for me.

hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt

我的 gzipped 文件是 Links.txt.gz
输出存储在 /tmp/unzipped/Links.txt

My gzipped file is Links.txt.gz
The output gets stored in /tmp/unzipped/Links.txt

使用Java程序

Hadoop The Definitve Guide 一书中,有一节介绍了Codecs.在该部分,有一个使用 CompressionCodecFactory 解压缩输出的程序.我正在按原样重新生成该代码:

In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:

package com.myorg.hadooptests;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;

import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;

public class FileDecompressor {
    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        Path inputPath = new Path(uri);
        CompressionCodecFactory factory = new CompressionCodecFactory(conf);
        CompressionCodec codec = factory.getCodec(inputPath);
        if (codec == null) {
            System.err.println("No codec found for " + uri);
            System.exit(1);
        }
        String outputUri =
        CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
        InputStream in = null;
        OutputStream out = null;
        try {
            in = codec.createInputStream(fs.open(inputPath));
            out = fs.create(new Path(outputUri));
            IOUtils.copyBytes(in, out, conf);
        } finally {
            IOUtils.closeStream(in);
            IOUtils.closeStream(out);
        }
    }
}

此代码将 gz 文件路径作为输入.
你可以这样执行:

This code takes the gz file path as input.
You can execute this as:

FileDecompressor <gzipped file name>

例如当我为我的 gzip 文件执行时:

For e.g. when I executed for my gzipped file:

FileDecompressor /tmp/Links.txt.gz

我在以下位置获得了解压文件:/tmp/Links.txt

I got the unzipped file at location: /tmp/Links.txt

它将解压后的文件存储在同一文件夹中.所以需要修改这段代码,取2个输入参数:和<输出文件夹>.

It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.

一旦你让这个程序运行起来,你就可以编写一个 Shell/Perl/Python 脚本来为你拥有的每个输入调用这个程序.

Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.

使用 Pig 脚本

您可以编写一个简单的 Pig 脚本来实现这一点.

You can write a simple Pig script to achieve this.

我编写了以下脚本,该脚本有效:

I wrote the following script, which works:

A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/

当您运行此脚本时,解压后的内容存储在一个临时文件夹中:/tmp/tmp_unzipped.该文件夹将包含

When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain

/tmp/tmp_unzipped/_SUCCESS
/tmp/tmp_unzipped/part-m-00000

part-m-00000 包含解压文件.

因此,我们需要使用以下命令显式重命名它,最后删除/tmp/tmp_unzipped文件夹:

Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:

mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/

所以,如果你使用这个 Pig 脚本,你只需要注意参数化文件名(Links.txt.gz 和 Links.txt).

So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).

同样,一旦你让这个脚本运行起来,你就可以编写一个 Shell/Perl/Python 脚本来为你拥有的每个输入调用这个 Pig 脚本.

Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

这篇关于如何在hadoop的新目录中解压缩.gz文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆