如何将.gz文件解压缩到hadoop的新目录中? [英] How to unzip .gz files in a new directory in hadoop?

查看:1105
本文介绍了如何将.gz文件解压缩到hadoop的新目录中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在hdfs的一个文件夹中有一堆.gz文件。我想将所有这些.gz文件解压到hdfs中的新文件夹。我应该怎么做?

解决方案

我可以通过3种不同的方式来实现它。


  1. 使用Linux命令行




      hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put  -  /tmp/unzipped/Links.txt 

    我的gzip文件是 Links.txt.gz

    输出存储在 /tmp/unzipped/Links.txt中


  2. 使用Java程序

    Hadoop Definitve指南一书中,有关于编解码器的部分。在该节中,有一个程序使用 CompressionCodecFactory 解压缩输出。我正在重新生成该代码:

      package com.myorg.hadooptests; 

    导入org.apache.hadoop.conf.Configuration;
    导入org.apache.hadoop.fs.FileSystem;
    导入org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IOUtils;
    import org.apache.hadoop.io.compress.CompressionCodec;
    import org.apache.hadoop.io.compress.CompressionCodecFactory;

    import java.io.InputStream;
    import java.io.OutputStream;
    import java.net.URI;

    public class FileDecompressor {
    public static void main(String [] args)throws Exception {
    String uri = args [0];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri),conf);
    Path inputPath = new Path(uri);
    CompressionCodecFactory factory = new CompressionCodecFactory(conf);
    CompressionCodec codec = factory.getCodec(inputPath);
    if(codec == null){
    System.err.println(没有为+ uri找到编解码器;
    System.exit(1);
    }
    字符串outputUri =
    CompressionCodecFactory.removeSuffix(uri,codec.getDefaultExtension());
    InputStream in = null;
    OutputStream out = null;
    尝试{
    in = codec.createInputStream(fs.open(inputPath));
    out = fs.create(new Path(outputUri));
    IOUtils.copyBytes(in,out,conf);
    } finally {
    IOUtils.closeStream(in);
    IOUtils.closeStream(out);





    这段代码需要gz文件路径作为输入。

    您可以这样执行:

      FileDecompressor< gzipped file name> 

    例如当我为我的gzip文件执行时:

      FileDecompressor /tmp/Links.txt.gz 

    我在位置获得了解压缩文件: /tmp/Links.txt



    它将解压缩的文件存储在同一个文件夹中。因此,您需要修改此代码以获取2个输入参数:<输入文件路径>和<输出文件夹>



    一旦你使用这个程序,你可以编写一个Shell / Perl / Python脚本来调用这个程序为您的每个输入。


  3. 使用Pig脚本

    您可以编写一个简单的Pig脚本来实现此目的。



    我编写了以下脚本,它可以工作:

      A = LOAD'/tmp/Links.txt.gz'使用PigStorage(); 
    将A存储到'/ tmp / tmp_unzipped /'使用PigStorage();
    mv / tmp / tmp_unzipped / part-m-00000 /tmp/unzipped/Links.txt
    rm / tmp / tmp_unzipped /

    运行此脚本时,解压后的内容将存储在临时文件夹中: / tmp / tmp_unzipped 。此文件夹将包含

      / tmp / tmp_unzipped / _SUCCESS 
    / tmp / tmp_unzipped / part-m-00000

    part-m-00000 包含解压缩后的文件。

    因此,我们需要使用以下命令显式重命名它,最后删除 / tmp / tmp_unzipped 文件夹:

      mv / tmp / tmp_unzipped / part-m-00000 /tmp/unzipped/Links.txt 
    rm / tmp / tmp_unzipped /

    所以,如果你使用这个Pig脚本,你只需要照顾参数化文件名(Links.txt.gz和Links.txt)。



    同样,一旦你获得了这个脚本的工作,你可以编写一个Shell / Perl / Python脚本来为你的每个输入调用这个Pig脚本。



I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

解决方案

I can think of achieving it through 3 different ways.

  1. Using Linux command line

    Following command worked for me.

    hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
    

    My gzipped file is Links.txt.gz
    The output gets stored in /tmp/unzipped/Links.txt

  2. Using Java program

    In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:

    package com.myorg.hadooptests;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IOUtils;
    import org.apache.hadoop.io.compress.CompressionCodec;
    import org.apache.hadoop.io.compress.CompressionCodecFactory;
    
    import java.io.InputStream;
    import java.io.OutputStream;
    import java.net.URI;
    
    public class FileDecompressor {
        public static void main(String[] args) throws Exception {
            String uri = args[0];
            Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(URI.create(uri), conf);
            Path inputPath = new Path(uri);
            CompressionCodecFactory factory = new CompressionCodecFactory(conf);
            CompressionCodec codec = factory.getCodec(inputPath);
            if (codec == null) {
                System.err.println("No codec found for " + uri);
                System.exit(1);
            }
            String outputUri =
            CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
            InputStream in = null;
            OutputStream out = null;
            try {
                in = codec.createInputStream(fs.open(inputPath));
                out = fs.create(new Path(outputUri));
                IOUtils.copyBytes(in, out, conf);
            } finally {
                IOUtils.closeStream(in);
                IOUtils.closeStream(out);
            }
        }
    }
    

    This code takes the gz file path as input.
    You can execute this as:

    FileDecompressor <gzipped file name>
    

    For e.g. when I executed for my gzipped file:

    FileDecompressor /tmp/Links.txt.gz
    

    I got the unzipped file at location: /tmp/Links.txt

    It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.

    Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.

  3. Using Pig script

    You can write a simple Pig script to achieve this.

    I wrote the following script, which works:

    A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
    Store A into '/tmp/tmp_unzipped/' USING PigStorage();
    mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
    rm /tmp/tmp_unzipped/
    

    When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain

    /tmp/tmp_unzipped/_SUCCESS
    /tmp/tmp_unzipped/part-m-00000
    

    The part-m-00000 contains the unzipped file.

    Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:

    mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
    rm /tmp/tmp_unzipped/
    

    So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).

    Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

这篇关于如何将.gz文件解压缩到hadoop的新目录中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆