如何在hadoop / map reduce中创建固定行数的输出文件? [英] How to create output files with fixed number of lines in hadoop/map reduce?
问题描述
- 是否可以使用单个MR作业完成此操作?
- 我们应该打开
- Is it possible to do this using single MR job?
- We should open the files for writing explicitly in reducer.
- The records in output should be shuffled.
谢谢,<假设输入文件有990条记录,必须分成9个文件,分别是9个文件,一个文件,一个文件,一个文件,一个文件,一个文件,一个文件,一个文件,一个文件,一个文件, 100个记录和90个记录的最后一个文件。总共10个文件 使用 NLineInputFormat 并将 mapred.line.input.format.linespermap
设置为100.这样每个映射器将处理来自输入数据集的100行。将缩减器的数量设置为10,这是输出文件的数量。
在映射器中发射1到10之间的键(这是输出文件的数量)和将该值作为输入记录发出。确保映射器发出的键在1和10之间平衡,且不会歪斜。
Let's say we have N input files with different number of lines. We need to generate output files such the each output file has exactly K number of lines (except the last one which can have < K records).
thanks,
Paramesh
Assuming that the input file has 990 records which have to be split into 9 files of 100 records each and the last file of 90 records. A total of 10 files
Use the NLineInputFormat and set the mapred.line.input.format.linespermap
to 100. This way each mapper will process 100 lines from the input data set. Set the number of reducers to 10, which is the number of output files.
In the mapper emit Key between 1 and 10 (which is the number of output files) and emit the value as the input record. Make sure that the keys emitted by mappers are balanced between 1 and 10 and not skewed.
这篇关于如何在hadoop / map reduce中创建固定行数的输出文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!