如何将Hadoop Streaming与LZO压缩的序列文件一起使用? [英] How to use Hadoop Streaming with LZO-compressed Sequence Files?

查看:262
本文介绍了如何将Hadoop Streaming与LZO压缩的序列文件一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Amazon的Elastic Map Reduce来处理Google ngrams数据集。在 http://aws.amazon.com/datasets/8172056142375670 有一个公共数据集,而我希望使用Hadoop流。



对于输入文件,它说:我们将数据集存储在Amazon S3中的单个对象中。该文件采用顺序文件格式LZO压缩,序列文件关键字是以LongWritable形式存储的数据集的行号,该值是以TextWritable形式存储的原始数据。



我需要什么用Hadoop Streaming来处理这些输入文件吗?



我尝试在我的参数中添加额外的-inputformat SequenceFileAsTextInputFormat,但这似乎不起作用 - 我的工作因某些未明确的原因而失败。有没有其他的参数我错过了?



我尝试使用一个非常简单的身份作为我的映射器和reducer

 #!/ usr / bin / env ruby​​ 

STDIN.each do | line |
puts line
end

但这不起作用。

解决方案

lzo被封装为弹性mapreduce的一部分,因此不需要安装任何东西。



 
hadoop jar〜hadoop / contrib / streaming / hadoop-streaming.jar \\ \\
-D mapred.reduce.tasks = 0 \
-input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
-inputformat SequenceFileAsTextInputFormat \
-output test_output \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper


I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming.

For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."

What do I need to do in order to process these input files with Hadoop Streaming?

I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this doesn't seem to work -- my jobs keep failing for some unspecified reason. Are there other arguments I'm missing?

I've tried using a very simple identity as both my mapper and reducer

#!/usr/bin/env ruby

STDIN.each do |line|
  puts line
end

but this doesn't work.

解决方案

lzo is packaged as part of elastic mapreduce so there's no need to install anything.

i just tried this and it works...

 hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
  -D mapred.reduce.tasks=0 \
  -input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
  -inputformat SequenceFileAsTextInputFormat \
  -output test_output \
  -mapper org.apache.hadoop.mapred.lib.IdentityMapper

这篇关于如何将Hadoop Streaming与LZO压缩的序列文件一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆