如何在hadoop mapreduce中进行lzo压缩？ [英] How to have lzo compression in hadoop mapreduce?

查看：134 发布时间：2018/5/31 19:43:02 hadoop mapreduce

本文介绍了如何在hadoop mapreduce中进行lzo压缩？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用lzo来压缩地图输出，但我无法运行它！我使用的Hadoop版本是 0.20.2 。我设置了：

$ $ p $ conf.set（mapred.compress.map.output，true） conf .set（mapred.map.output.compression.codec， org.apache.hadoop.io.compress.LzoCodec）;

当我在Hadoop中运行jar文件时，它显示无法写入映射输出的异常。 / p>

我必须安装lzo吗？
我需要做什么才能使用lzo？

解决方案

LZO的许可证（GPL）与Hadoop的许可证不兼容（Apache），因此它不能与它捆绑在一起。需要在群集上单独安装LZO。

以下步骤在（CentOS 6.2，x64）配备了全套CDH 4.2.0和CM免费版，但它们应该可以在任何基于Linux Red Hat。

安装包含以下步骤：

安装LZO

sudo yum install lzop

sudo yum install lzo-devel

安装ANT

sudo yum install ant ant-nodeps ant-junit java-devel

源

git clone https://github.com/twitter/hadoop-lzo.git

编译Hadoop-LZO

$ b ant编译原生tar

有关更多说明和疑难解答，请参阅 https://github.com/twitter/hadoop-lzo

将Hapoop-LZO jar复制到Hadoop库

$ b $ sudo cp build / hadoop -lzo * .jar / usr / lib / hadoop / lib /

将原生代码移至Hadoop本地库

sudo mv build / hadoop -lzo-0.4.17-SNAPSHOT / lib / native / Linux-amd64-64 / / usr / lib / hadoop / lib / native /

cp /usr/lib/hadoop/lib/native/Linux-amd64-64/libgplcompression.* / usr / lib / hadoop /使用您所克隆版本的正确版本号
p>当使用真正的集群（而不是伪集群）时，您需要将这些集群与其他机器rsync同步。 $ b $ prsync / usr / lib / hadoop / lib / 到所有主机。登录Cloudera Manager 客户端 - >压缩从服务中选择：mapreduce1->配置 li> 添加到压缩编解码器： com.hadoop.compression.lzo.LzoCodec com.hadoop.compression.lzo.LzopCodec 搜索阀门添加到MapReduce服务配置安全阀 io.compression.codec.lzo.class = com.hadoop.compression.lzo.LzoCodec mapred.child.env =JAVA_LIBRARY_PATH = / usr / lib / hadoop / lib / native / Linux-amd64-64 / 添加到MapReduce服务环境安全阀 $ b HADOOP_CLASSPATH = / usr / lib / hadoop / lib / * li>

 
 
 就是这样。 
 
 
您的MarReduce作业使用 TextInputFormat 应该与 .lz无缝协作o 文件。但是，如果您选择索引LZO文件以使它们可拆分（使用 com.hadoop.compression.lzo.DistributedLzoIndexer ），您会发现索引器写入 .index 文件放在每个 .lzo 文件旁边。这是一个问题，因为您的 TextInputFormat 会将这些解释为输入的一部分。在这种情况下，您需要将MR作业更改为使用 LzoTextInputFormat 。
 
 
 从Hive开始，只要你不索引LZO文件，这个变化也是透明的。如果您开始建立索引（以利用更好的数据分布），则需要将输入格式更新为 LzoTextInputFormat 。如果您使用分区，则可以为每个分区执行此操作。
 
I want to use lzo to compress map output but I can't run it! The version of Hadoop I used is 0.20.2. I set:
conf.set("mapred.compress.map.output", "true") 
conf.set("mapred.map.output.compression.codec",
"org.apache.hadoop.io.compress.LzoCodec");
When I run the jar file in Hadoop it shows an exception that can't write map output.

Do I have to install lzo? 
What do I have to do to use lzo?
 解决方案 
LZO's licence (GPL) is incompatible with that of Hadoop (Apache) and therefore it cannot be bundled with it. One needs to install LZO separately on the cluster.

The following steps are tested on Cloudera's Demo VM (CentOS 6.2, x64) that comes with full stack of CDH 4.2.0 and CM Free Edition installed, but they should work on any Linux based on Red Hat.

The installation consists of the following steps:


Installing LZO

sudo yum install lzop

sudo yum install lzo-devel
Installing ANT

sudo yum install ant ant-nodeps ant-junit java-devel
Downloading the source

git clone https://github.com/twitter/hadoop-lzo.git
Compiling Hadoop-LZO

ant compile-native tar

For further instructions and troubleshooting see https://github.com/twitter/hadoop-lzo
Copying Hapoop-LZO jar to Hadoop libs

sudo cp build/hadoop-lzo*.jar /usr/lib/hadoop/lib/
Moving native code to Hadoop native libs

sudo mv build/hadoop-lzo-0.4.17-SNAPSHOT/lib/native/Linux-amd64-64/  /usr/lib/hadoop/lib/native/

cp /usr/lib/hadoop/lib/native/Linux-amd64-64/libgplcompression.* /usr/lib/hadoop/lib/native/

Correct version number with the version you cloned
When working with a real cluster (as opposed to a pseudo-cluster) you need to rsync these to the rest of the machines

rsync /usr/lib/hadoop/lib/ to all hosts. 

You can dry run this first with -n
Login to Cloudera Manager
Select from Services: mapreduce1->Configuration
Client->Compression
Add to Compression Codecs:

com.hadoop.compression.lzo.LzoCodec

com.hadoop.compression.lzo.LzopCodec
Search "valve"
Add to MapReduce Service Configuration Safety Valve

io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec
mapred.child.env="JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64/"
Add to MapReduce Service Environment Safety Valve

HADOOP_CLASSPATH=/usr/lib/hadoop/lib/*


That's it.

Your MarReduce jobs that use TextInputFormat should work seamlessly with .lzo files. However, if you choose to index the LZO files to make them splittable (using com.hadoop.compression.lzo.DistributedLzoIndexer), you will find that the indexer writes a .index file next to each .lzo file. This is a problem because your TextInputFormat will interpret these as part of the input. In this case you need to change your MR jobs to work with LzoTextInputFormat.

As of Hive,  as long as you don't index the LZO files, the change is also transparent. If you start indexing (to take advantage of a better data distribution) you will need to update the input format to LzoTextInputFormat. If you use partitions, you can do it per partition.

                        这篇关于如何在hadoop mapreduce中进行lzo压缩？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何在hadoop mapreduce中进行lzo压缩？ [英] How to have lzo compression in hadoop mapreduce?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

如何在hadoop mapreduce中进行lzo压缩？ [英] How to have lzo compression in hadoop mapreduce?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭