如何将Hadoop Streaming与LZO压缩的序列文件一起使用? [英] How to use Hadoop Streaming with LZO-compressed Sequence Files?
问题描述
我正在尝试使用Amazon的Elastic Map Reduce来处理Google ngrams数据集。在 http://aws.amazon.com/datasets/8172056142375670 有一个公共数据集,而我希望使用Hadoop流。
对于输入文件,它说:我们将数据集存储在Amazon S3中的单个对象中。该文件采用顺序文件格式LZO压缩,序列文件关键字是以LongWritable形式存储的数据集的行号,该值是以TextWritable形式存储的原始数据。
我需要什么用Hadoop Streaming来处理这些输入文件吗?
我尝试在我的参数中添加额外的-inputformat SequenceFileAsTextInputFormat,但这似乎不起作用 - 我的工作因某些未明确的原因而失败。有没有其他的参数我错过了?
我尝试使用一个非常简单的身份作为我的映射器和reducer
#!/ usr / bin / env ruby
STDIN.each do | line |
puts line
end
但这不起作用。
lzo被封装为弹性mapreduce的一部分,因此不需要安装任何东西。
hadoop jar〜hadoop / contrib / streaming / hadoop-streaming.jar \\ \\
-D mapred.reduce.tasks = 0 \
-input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
-inputformat SequenceFileAsTextInputFormat \
-output test_output \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper
I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming.
For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."
What do I need to do in order to process these input files with Hadoop Streaming?
I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this doesn't seem to work -- my jobs keep failing for some unspecified reason. Are there other arguments I'm missing?
I've tried using a very simple identity as both my mapper and reducer
#!/usr/bin/env ruby
STDIN.each do |line|
puts line
end
but this doesn't work.
lzo is packaged as part of elastic mapreduce so there's no need to install anything.
i just tried this and it works...
hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \ -D mapred.reduce.tasks=0 \ -input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \ -inputformat SequenceFileAsTextInputFormat \ -output test_output \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper
这篇关于如何将Hadoop Streaming与LZO压缩的序列文件一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!