如何启用Snappy / Snappy Codec over Google Compute Engine的hadoop群集 [英] How to enable Snappy/Snappy Codec over hadoop cluster for Google Compute Engine

查看:274
本文介绍了如何启用Snappy / Snappy Codec over Google Compute Engine的hadoop群集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在Google Compute引擎上运行Hadoop作业,以对抗我们在Google云端存储上的压缩数据。
在尝试通过SequenceFileInputFormat读取数据时,出现以下异常:

  hadoop @ hadoop -m:/ home / salikeeno $ hadoop jar $ {JAR} $ {PROJECT} $ {OUTPUT_TABLE} 
14/08/21 19:56:00信息jaws.JawsApp:使用导出存储桶'askbuckerthroughhadoop',如'mapred.bq .gcs.bucket'
14/08/21 19:56:00 INFO bigquery.BigQueryConfiguration:使用指定的项目ID'regal-campaign-641'输出
14/08/21 19:56 :00 INFO gcs.GoogleHadoopFileSystemBase:GHFS版本:1.2.8-hadoop1
14/08/21 19:56:01警告mapred.JobClient:使用GenericOptionsParser解析参数。应用程序应该实现相同的工具。
14/08/21 19:56:03 INFO input.FileInputFormat:要输入的总输入路径:1
14/08/21 19:56:09信息mapred.JobClient:正在运行的作业:job_201408211943_0002
14/08/21 19:56:10信息mapred.JobClient:地图0%减少0%
14/08/21 19:56:20信息mapred.JobClient:任务ID:attempt_201408211943_0002_m_000001_0,状态: FAILED
java.lang.RuntimeException:原生snappy库不可用
在org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:189)
在org.apache。 hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:125)$ or $ $ b $ org.apache.hadoop.io.SequenceFile $ Reader.init(SequenceFile.java:1581)
at org.apache。 hadoop.io.SequenceFile $ Reader。< init>(SequenceFile.java:1490)
at org.apache.hadoop.io.SequenceFile $ Reader。< init>(SequenceFile.java:1479)
在org.apache.hadoop.io.SequenceFile $ Reader。< init>(SequenceFile.java:1474)
at org.apache.hadoo p.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
at org.apache.hadoop.mapred.MapTask $ NewTrackingRecordReader.initialize(MapTask.java:521)
at org。 apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
位于org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
位于org.apache.hadoop。在java.security.AccessController.doPrivileged(本机方法)
在javax.security.auth.Subject.doAs(Subject.java:415)mapred.Child $ 4.run(Child.java:255)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)




  1. 看来SnappyCodec不可用。我应该如何在Google计算引擎中的Hadoop集群中包含/启用Snappy?

  2. 部署Hadoop集群时,我可以通过bdutil脚本部署Snappy lib(如果必须)吗?
  3. >
  4. 在Google Compute引擎上部署的Hadoop集群上部署第三方库/罐的最佳方法是什么?

非常感谢

解决方案

此过程不再需要。



bdutil部署默认包含Snappy。



作为参考,原始答案:



问题在一般情况下最容易回答,所以我会从那里开始。运输依赖关系的一般指导是应用程序应该利用分布式缓存将JAR和库分发给工作人员(Hadoop 1或2)。如果您的代码已经使用GenericOptionsParser,则可以使用-libjars标志来分发JAR。可以在Cloudera的博客上找到更长的讨论,该博客还讨论了胖JAR: http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce -job /



安装和配置其他系统级组件bdutil支持扩展机制。扩展的一个很好的例子是与bdutil:extensions / spark / spark_env.sh捆绑在一起的Spark扩展。在运行bdutil扩展时添加了-e标志,例如,用Hadoop部署Spark:

  ./ bdutil -e extensions / spark / spark_env.sh deploy 

关于您的第一个和第二个问题:交易时有两个障碍与Gna上的Hadoop中的Snappy一起。首先,由Apache构建并与Hadoop 2 tarball捆绑在一起的本地支持库是为i386构建的,而GCE实例是amd64。 Hadoop 1捆绑了这两种平台的二进制文件,但如果没有捆绑或修改环境,snappy就无法定位。由于这种体系结构的差异,在Hadoop 2中没有可用的原生压缩器(snappy或其他),并且Snappy在Hadoop 1中不易使用。第二个障碍是libsnappy本身并未默认安装。

解决这两个问题的最简单方法是创建自己的包含amd64原生Hadoop库和libsnappy的Hadoop压缩包。下面的步骤应该可以帮助你做到这一点,并将结果压缩包放到bdutil中使用。

首先,使用Debian Wheezy backports映像启动新的GCE虚拟机,并授予VM服务帐户对云存储的读/写访问权限。我们将使用它作为构建机器,我们可以在构建/存储二进制文件后立即放弃它。



通过SSH连接到您的新实例并运行以下命令,检查任何错误:

  sudo apt-get update $ b $ sudo apt-get install pkg-config libsnappy-dev libz-dev libssl-dev gcc make cmake automake autoconf libtool g ++ openjdk-7-jdk maven ant 

export JAVA_HOME = / usr / lib / jvm / java-7 -openjdk-amd64 /

wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop -1.2.1 / hadoop-1.2.1.tar.gz

tar zxvf hadoop-1.2.1.tar.gz
pushd hadoop-1.2.1 /

#捆绑libsnappy,因此我们不必在每台机器上安装apt-get
cp / usr / lib / libsnappy * lib / native / Linux-amd64-64 /

#测试使某些Snappy被加载并正在工作:
bin / hadoop jar ./hadoop-test-1.2.1.jar testsequencefile -seed 0 -count 1000 -compressType RECORD xxx -codec org.apache.hadoop.io.compress.SnappyCodec -check

#创建一个新的Hadoop 1.2.1 tarball:
popd
rm hadoop -1.2.1.tar.gz
tar zcvf hadoop-1.2.1.tar.gz hadoop-1.2.1 /

#在GCS上存储tarball:
gsutil cp hadoop-1.2.1.tar.gz gs://< some bucket> /hadoop-1.2.1.tar.gz



用Snappy构建Hadoop 2.4.1



通过SSH连接到您的新实例并运行以下命令,检查以下任何错误:

  sudo apt-get update $ b $ sudo apt-get install pkg-config libsnappy-dev libz-dev libssl-dev gcc make cmake automake autoconf libtool g ++ openjdk-7-jdk maven ant 

export JAVA_HOME = / usr / lib / jvm / java-7 -openjdk-amd64 /

#Protobuf 2.5。 0是必需的,而不是在Debian-backports中
wget http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
tar xvf protobuf-2.5.0.tar.gz
pushd protobuf-2.5.0 /&& ./configure&&制作&& sudo make install&& popd
sudo ldconfig

wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz

#解压源代码
tar zxvf hadoop-2.4.1-src.tar.gz
pushd hadoop-2.4.1-src

#构建Hadoop
mvn package -Pdist,native -DskipTests -Dtar $ b $ pushd hadoop-dist / target /
pushd hadoop-2.4.1 /

#捆绑libsnappy,所以我们不用不必apt-get在每台机器上安装
cp / usr / lib / libsnappy * lib / native /

#测试一切正常:
bin / hadoop jar share / hadoop / common / hadoop-common-2.4.1-tests.jar org.apache.hadoop.io.TestSequenceFile -seed 0 -count 1000 -compressType RECORD xxx -codec org.apache.hadoop.io.compress.SnappyCodec -check

popd

#用libsnappy创建一个新的tarball:
rm hadoop-2.4.1.tar.gz
tar zcf hadoop-2.4 .1.tar.gz hadoop-2.4.1 /

#将新的tarball存储在GCS上:
gsutil cp hadoop-2.4.1.tar.gz gs://< some桶> /hadoop-2.4.1.tar.gz

popd
popd



更新bdutil_env.sh或hadoop2_env。 sh



一旦您拥有捆绑了正确本地库的Hadoop版本,我们可以通过更新bdutil_env.sh for Hadoop 1或hadoop2_env来将bdutil指向新的Hadoop tarball。 sh for Hadoop 2.在任何一种情况下,打开适当的文件并沿着以下方式查找块:

 #URI Hadoop tarball被部署。必须以GS://开头或HTTP(S):// 
#使用 '的gsutil LS GS://hadoop-dist/hadoop-*.tar.gz' 列出谷歌提供的选项
HADOOP_TARBALL_URI ='gs://hadoop-dist/hadoop-1.2.1-bin.tar.gz'

并将指向的URI改为上面存储tar包的URI:例如,

  HADOOP_TARBALL_URI ='gs:/ /< some bucket> /hadoop-1.2.1.tar.gz'


I am trying to run Hadoop Job on Google Compute engine against our compressed data, which is sitting on Google Cloud Storage. While trying to read the data through SequenceFileInputFormat, I get the following exception:

hadoop@hadoop-m:/home/salikeeno$ hadoop jar ${JAR} ${PROJECT} ${OUTPUT_TABLE}
14/08/21 19:56:00 INFO jaws.JawsApp: Using export bucket 'askbuckerthroughhadoop' as specified in 'mapred.bq.gcs.bucket'
14/08/21 19:56:00 INFO bigquery.BigQueryConfiguration: Using specified project-id 'regal-campaign-641' for output
14/08/21 19:56:00 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
14/08/21 19:56:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/21 19:56:03 INFO input.FileInputFormat: Total input paths to process : 1
14/08/21 19:56:09 INFO mapred.JobClient: Running job: job_201408211943_0002
14/08/21 19:56:10 INFO mapred.JobClient:  map 0% reduce 0%
14/08/21 19:56:20 INFO mapred.JobClient: Task Id : attempt_201408211943_0002_m_000001_0, Status : FAILED
java.lang.RuntimeException: native snappy library not available
        at org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:189)
        at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:125)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1581)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1490)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
        at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:521)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

  1. It seems that the SnappyCodec is not available. How should I include/enable Snappy in my Hadoop cluster on google compute engine?
  2. Can I deploy Snappy lib (if I have to) through bdutil script while deploying a Hadoop cluster?
  3. What is the best approach to deploy third party libs/jars on Hadoop cluster deployed on Google Compute engine?

Thanks a lot

解决方案

This procedure is no longer required.

A bdutil deployment will contain Snappy by default.

For reference, the original answer:

Your last question is the easiest to answer in the general case so I'll begin there. The general guidance for shipping dependencies is that applications should make use of the distributed cache to distribute JARs and libraries to workers (Hadoop 1 or 2). If your code is already making use of the GenericOptionsParser you can distrubte JARs with the -libjars flag. A longer discussion can be found on Cloudera's blog that also discusses fat JARs: http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

For installing and configuring other system-level components bdutil supports an extension mechanism. A good example of extensions is the Spark extension bundled with bdutil: extensions/spark/spark_env.sh. When running bdutil extensions are added with the -e flag e.g., to deploy Spark with Hadoop:

./bdutil -e extensions/spark/spark_env.sh deploy    

With regard to your first and second questions: there are two obstacles when dealing with Snappy in Hadoop on GCE. The first is that the native support libraries built by Apache and bundled with Hadoop 2 tarballs are built for i386 while GCE instances are amd64. Hadoop 1 bundles binaries for both platforms, but snappy is not locatable without either bundling or modifying the environment. Because of this architecture difference, no native compressors are usable (snappy or otherwise) in Hadoop 2 and Snappy is not available easily in Hadoop 1. The second obstacle is that libsnappy itself is not installed by default.

The easiest way to overcome both of these is to create your own Hadoop tarball containing amd64 native Hadoop libraries as well as libsnappy. The steps below should help you do this and stage the resulting tarball for use by bdutil.

To start, launch a new GCE VM using a Debian Wheezy backports image and grant the VM service account read/write access to Cloud Storage. We'll use this as our build machine and we can safely discard it as soon as we're done building / storing the binary.

Building Hadoop 1.2.1 with Snappy

SSH to your new instance and run the following commands, checking for any errors along the way:

sudo apt-get update
sudo apt-get install pkg-config libsnappy-dev libz-dev libssl-dev gcc make cmake automake autoconf libtool g++ openjdk-7-jdk maven ant

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz

tar zxvf hadoop-1.2.1.tar.gz 
pushd hadoop-1.2.1/

# Bundle libsnappy so we don't have to apt-get install it on each machine
cp /usr/lib/libsnappy* lib/native/Linux-amd64-64/

# Test to make certain Snappy is being loaded and is working:
bin/hadoop jar ./hadoop-test-1.2.1.jar testsequencefile -seed 0 -count 1000 -compressType RECORD xxx -codec org.apache.hadoop.io.compress.SnappyCodec -check

# Create a new tarball of Hadoop 1.2.1:
popd
rm hadoop-1.2.1.tar.gz
tar zcvf hadoop-1.2.1.tar.gz hadoop-1.2.1/

# Store the tarball on GCS: 
gsutil cp hadoop-1.2.1.tar.gz gs://<some bucket>/hadoop-1.2.1.tar.gz

Building Hadoop 2.4.1 with Snappy

SSH to your new instance and run the following commands, checking for any errors along the way:

sudo apt-get update
sudo apt-get install pkg-config libsnappy-dev libz-dev libssl-dev gcc make cmake automake autoconf libtool g++ openjdk-7-jdk maven ant

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

# Protobuf 2.5.0 is required and not in Debian-backports
wget http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
tar xvf protobuf-2.5.0.tar.gz
pushd protobuf-2.5.0/ && ./configure && make && sudo make install && popd
sudo ldconfig

wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz

# Unpack source
tar zxvf hadoop-2.4.1-src.tar.gz
pushd hadoop-2.4.1-src

# Build Hadoop
mvn package -Pdist,native -DskipTests -Dtar
pushd hadoop-dist/target/
pushd hadoop-2.4.1/

# Bundle libsnappy so we don't have to apt-get install it on each machine
cp /usr/lib/libsnappy* lib/native/

# Test that everything is working:
bin/hadoop jar share/hadoop/common/hadoop-common-2.4.1-tests.jar org.apache.hadoop.io.TestSequenceFile -seed 0 -count 1000 -compressType RECORD xxx -codec org.apache.hadoop.io.compress.SnappyCodec -check

popd

# Create a new tarball with libsnappy:
rm hadoop-2.4.1.tar.gz
tar zcf hadoop-2.4.1.tar.gz hadoop-2.4.1/

# Store the new tarball on GCS:
gsutil cp hadoop-2.4.1.tar.gz gs://<some bucket>/hadoop-2.4.1.tar.gz

popd
popd

Updating bdutil_env.sh or hadoop2_env.sh

Once you have a Hadoop version with the correct native libraries bundled, we can point bdutil at the new Hadoop tarball by updating either bdutil_env.sh for Hadoop 1 or hadoop2_env.sh for Hadoop 2. In either case, open the approprirate file and look for a block along the lines of:

# URI of Hadoop tarball to be deployed. Must begin with gs:// or http(s)://
# Use 'gsutil ls gs://hadoop-dist/hadoop-*.tar.gz' to list Google supplied options
HADOOP_TARBALL_URI='gs://hadoop-dist/hadoop-1.2.1-bin.tar.gz'

and change the URI pointed to to be the URI where we stored the tarball above: e.g.,

HADOOP_TARBALL_URI='gs://<some bucket>/hadoop-1.2.1.tar.gz'

这篇关于如何启用Snappy / Snappy Codec over Google Compute Engine的hadoop群集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆