Hadoop流媒体:每个地图的单个文件或多个文件。不要分裂 [英] Hadoop streaming: single file or multi file per map. Don't Split

查看:90
本文介绍了Hadoop流媒体:每个地图的单个文件或多个文件。不要分裂的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多需要由C ++库处理的zip文件。所以我使用C ++编写我的hadoop流媒体程序。该程序将读取一个zip文件,解压缩并处理提取的数据。
我的问题是:


  1. 我的映射程序无法获取恰好一个文件的内容。它通常会得到2.4文件或3.2文件。 Hadoop会将几个文件发送给我的映射器,但至少有一个文件是部分文件。你知道zip文件不能像这样处理。
    我可以每个地图只准确一个文件吗?我不想使用文件列表作为输入,并从我的程序中读取它,因为我想要获得数据局部性的优势。

  2. 我可以接受如果Hadoop不分割zip文件,则每个地图的多个zip文件的内容。我的意思是1,2,3个文件,而不是2.3个文件。事实上,它会更好,因为我的程序需要加载大约800MB的数据文件来处理未经处理的数据。我们可以做到这一点吗? / p>

    http ://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F



    我建议的最简单的方法是设置 mapred.min.split.size 为一个很大的值,这样您的文件就不会被分割。



    不起作用,那么你需要实现一个 InputFormat ,这不是很难做到,你可以在 http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat


    I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data. My problem is that:

    1. my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this. Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality.

    2. I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this?

    解决方案

    You can find the solution here:

    http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F

    The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.

    If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat

    这篇关于Hadoop流媒体:每个地图的单个文件或多个文件。不要分裂的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆