如何将 Hadoop Streaming 与 LZO 压缩的序列文件一起使用? [英] How to use Hadoop Streaming with LZO-compressed Sequence Files?

查看:17
本文介绍了如何将 Hadoop Streaming 与 LZO 压缩的序列文件一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Amazon 的 Elastic Map Reduce 来处理 Google ngrams 数据集.http://aws.amazon.com/datasets/8172056142375670 有一个公共数据集,我想使用 Hadoop 流.

对于输入文件,它说我们将数据集存储在 Amazon S3 中的单个对象中.该文件是具有块级 LZO 压缩的序列文件格式.序列文件键是存储为LongWritable,值是存储为 TextWritable 的原始数据."

为了使用 Hadoop Streaming 处理这些输入文件,我需要做什么?

我尝试在我的论点中添加一个额外的-inputformat SequenceFileAsTextInputFormat",但这似乎不起作用——我的工作由于某些未指明的原因而不断失败.我还缺少其他论点吗?

我尝试使用一个非常简单的身份作为我的映射器和减速器

#!/usr/bin/env rubySTDIN.each 做 |line|放线结尾

但这不起作用.

解决方案

lzo 被打包为 elastic mapreduce 的一部分,因此无需安装任何东西.

我刚试过这个,它可以工作......

<上一页>hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar -D mapred.reduce.tasks=0 -输入 s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/-inputformat SequenceFileAsTextInputFormat -输出测试输出-mapper org.apache.hadoop.mapred.lib.IdentityMapper

I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming.

For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."

What do I need to do in order to process these input files with Hadoop Streaming?

I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this doesn't seem to work -- my jobs keep failing for some unspecified reason. Are there other arguments I'm missing?

I've tried using a very simple identity as both my mapper and reducer

#!/usr/bin/env ruby

STDIN.each do |line|
  puts line
end

but this doesn't work.

解决方案

lzo is packaged as part of elastic mapreduce so there's no need to install anything.

i just tried this and it works...

 hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar 
  -D mapred.reduce.tasks=0 
  -input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ 
  -inputformat SequenceFileAsTextInputFormat 
  -output test_output 
  -mapper org.apache.hadoop.mapred.lib.IdentityMapper

这篇关于如何将 Hadoop Streaming 与 LZO 压缩的序列文件一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆