如何在hadoop / map reduce中创建固定行数的输出文件? [英] How to create output files with fixed number of lines in hadoop/map reduce?

查看:183
本文介绍了如何在hadoop / map reduce中创建固定行数的输出文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有N个输入文件,其行数不同。我们需要生成输出文件,例如每个输出文件都有K行(除了最后一行可以有< K记录)。


  • 是否可以使用单个MR作业完成此操作?

  • 我们应该打开



  • 谢谢,<假设输入文件有990条记录,必须分成9个文件,分别是9个文件,一个文件,一个文件,一个文件,一个文件,一个文件,一个文件,一个文件,一个文件,一个文件, 100个记录和90个记录的最后一个文件。总共10个文件 使用 NLineInputFormat 并将 mapred.line.input.format.linespermap 设置为100.这样每个映射器将处理来自输入数据集的100行。将缩减器的数量设置为10,这是输出文件的数量。

    在映射器中发射1到10之间的键(这是输出文件的数量)和将该值作为输入记录发出。确保映射器发出的键在1和10之间平衡,且不会歪斜。


    Let's say we have N input files with different number of lines. We need to generate output files such the each output file has exactly K number of lines (except the last one which can have < K records).

    • Is it possible to do this using single MR job?
    • We should open the files for writing explicitly in reducer.
    • The records in output should be shuffled.

    thanks,
    Paramesh

    解决方案

    Assuming that the input file has 990 records which have to be split into 9 files of 100 records each and the last file of 90 records. A total of 10 files

    Use the NLineInputFormat and set the mapred.line.input.format.linespermap to 100. This way each mapper will process 100 lines from the input data set. Set the number of reducers to 10, which is the number of output files.

    In the mapper emit Key between 1 and 10 (which is the number of output files) and emit the value as the input record. Make sure that the keys emitted by mappers are balanced between 1 and 10 and not skewed.

    这篇关于如何在hadoop / map reduce中创建固定行数的输出文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆