如何将Hadoop Streaming与LZO压缩的序列文件一起使用？ [英] How to use Hadoop Streaming with LZO-compressed Sequence Files?

查看：262 发布时间：2018/5/31 18:52:17 hadoop mapreduce amazon-emr

本文介绍了如何将Hadoop Streaming与LZO压缩的序列文件一起使用？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用Amazon的Elastic Map Reduce来处理Google ngrams数据集。在 http://aws.amazon.com/datasets/8172056142375670 有一个公共数据集，而我希望使用Hadoop流。

对于输入文件，它说：我们将数据集存储在Amazon S3中的单个对象中。该文件采用顺序文件格式LZO压缩，序列文件关键字是以LongWritable形式存储的数据集的行号，该值是以TextWritable形式存储的原始数据。

我需要什么用Hadoop Streaming来处理这些输入文件吗？

我尝试在我的参数中添加额外的-inputformat SequenceFileAsTextInputFormat，但这似乎不起作用 - 我的工作因某些未明确的原因而失败。有没有其他的参数我错过了？

我尝试使用一个非常简单的身份作为我的映射器和reducer

 ＃！/ usr / bin / env ruby 
 
 STDIN.each do | line | 
 puts line 
 end

但这不起作用。

解决方案

lzo被封装为弹性mapreduce的一部分，因此不需要安装任何东西。

 
 hadoop jar〜hadoop / contrib / streaming / hadoop-streaming.jar \\ \\ 
 -D mapred.reduce.tasks = 0 \ 
 -input s3n：//datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \ 
 -inputformat SequenceFileAsTextInputFormat \ 
 -output test_output \ 
 -mapper org.apache.hadoop.mapred.lib.IdentityMapper

I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming.

For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."

What do I need to do in order to process these input files with Hadoop Streaming?

I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this doesn't seem to work -- my jobs keep failing for some unspecified reason. Are there other arguments I'm missing?

I've tried using a very simple identity as both my mapper and reducer

#!/usr/bin/env ruby

STDIN.each do |line|
  puts line
end

but this doesn't work.

解决方案

lzo is packaged as part of elastic mapreduce so there's no need to install anything.

i just tried this and it works...

 hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
  -D mapred.reduce.tasks=0 \
  -input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
  -inputformat SequenceFileAsTextInputFormat \
  -output test_output \
  -mapper org.apache.hadoop.mapred.lib.IdentityMapper

这篇关于如何将Hadoop Streaming与LZO压缩的序列文件一起使用？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将Hadoop Streaming与LZO压缩的序列文件一起使用？ [英] How to use Hadoop Streaming with LZO-compressed Sequence Files?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

如何将Hadoop Streaming与LZO压缩的序列文件一起使用？ [英] How to use Hadoop Streaming with LZO-compressed Sequence Files?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭