在Hadoop中将多个文件合并为一个文件 [英] Merging multiple files into one within Hadoop
问题描述
我将多个小文件放到我的输入目录中,我希望将其合并到单个文件中,而无需使用本地文件系统或编写mapreds。有没有一种方法可以使用hadoof fs命令或Pig?
谢谢!
hadoop jar \
$ HADOOP_PREFIX / share / hadoop / tools / lib / hadoop-streaming.jar \< br>
-D mapred.reduce.tasks = 1 \
-D mapred.job.queue.name = $ QUEUE \
-input$ INPUT\
-output $ OUTPUT\
-mapper cat \
-reducer cat
如果你想压缩添加
-Dmapred.output.compress = true \
-Dmapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec
I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?
Thanks!
In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.
hadoop jar \
$HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
-Dmapred.reduce.tasks=1 \
-Dmapred.job.queue.name=$QUEUE \
-input "$INPUT" \
-output "$OUTPUT" \
-mapper cat \
-reducer cat
If you want compression add
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
这篇关于在Hadoop中将多个文件合并为一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!