尽管mapred.output.compress = true,hadoop流式产生未压缩的文件 [英] hadoop streaming produces uncompressed files despite mapred.output.compress=true
问题描述
hadoop jar / opt / cloudera / parcels / CDH / lib / hadoop -mapreduce / hadoop-streaming.jar
-Dmapred.reduce.tasks = 16
-Dmapred.output.compres = true
-Dmapred.output.compression.codec = org.apache.hadoop .io.compress.GzipCodec
-input foo
- 输出栏
-mapperpython zot.py
-reducer / bin / cat
我在输出目录中获得了包含正确数据的16个文件,但文件未被压缩:
$ hadoop fs -get bar / part-00012
$ file部分-00012
部分-00012:ASCII文本,很长行
- 为什么
part-00012
not compressed?
- 如何将我的数据集拆分为一个小数字(比如16个)gzip压缩文件?
PS。另见使用gzip作为还原器会产生损坏的数据
PPS。这是用于 vw 。
PPPS。我想我可以做 hadoop fs -get
, gzip
, hadoop fs -put
,
hadoop fs -rm
16次,但这看起来像是一种非同素异形的方式。
您的mapred.output.compres参数中存在拼写错误。如果你看看你的工作历史,我会打赌它关闭。
您也可以避免将reduce阶段放在一起,因为这只是捕获文件。除非您特别需要16部分文件,否则请尝试将其仅保留为地图。
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
-Dmapred.reduce.tasks = 0
-Dmapred.output.compress = true
-Dmapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec
-input foo
- 输出栏
-mapperpython zot.py
I run a hadoop streaming job like this:
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
-Dmapred.reduce.tasks=16
-Dmapred.output.compres=true
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
-input foo
-output bar
-mapper "python zot.py"
-reducer /bin/cat
I do get 16 files in the output directory which contain the correct data, but the files are not compressed:
$ hadoop fs -get bar/part-00012
$ file part-00012
part-00012: ASCII text, with very long lines
- why is
part-00012
not compressed? - how do I have my data set split into a small number (say, 16) gzip-compressed files?
PS. See also "Using gzip as a reducer produces corrupt data"
PPS. This is for vw.
PPPS. I guess I can do hadoop fs -get
, gzip
, hadoop fs -put
, hadoop fs -rm
16 times, but this seems like a very non-hadoopic way.
There is a typo in your mapred.output.compres parameter. If you look through your job history I'll bet it's turned off.
Also you could avoid having the reduce-stage all together, since that's just catting files. Unless you specifically need 16 part files, try leaving it map-only.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
-Dmapred.reduce.tasks=0
-Dmapred.output.compress=true
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
-input foo
-output bar
-mapper "python zot.py"
这篇关于尽管mapred.output.compress = true,hadoop流式产生未压缩的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!