Hadoop从另一个文件中的一个文件中搜索单词 [英] Hadoop searching words from one file in another file

查看:226
本文介绍了Hadoop从另一个文件中的一个文件中搜索单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



如果这个单词存在 - 它必须写入一个输出文件
如果这个单词不存在 - 它必须写入另一个输出文件



我在hadoop中尝试了几个例子。我有两个问题:

两个文件大约每个200MB。检查另一个文件中的每个单词可能会导致内存不足。有没有其他的方法来做到这一点?



如何将数据写入不同的文件,因为hadoop的reduce阶段的输出只写入一个文件。

谢谢。


  1. 将'map'中的值拆分为字,发出( (< word>,< source>)(* 1)

  2. ,您将进入'reduce':(< word>,< sources> list> >
  3. 检查源列表(对于所有源都可能是长的)
  4. 如果不是所有源都在列表中,则每次发出(< missingsource> ,< word>)

  5. job2:job.setNumReduceTasks(< numberofsources>)

  6. job2:发射'map'(< missingsource $ lt; word>)

  7. job2:在'reduce'中为每个< missingsource>发出所有(null,< word>)
  8. ol>

    您将得到与不同的同样多的减少输出,每个输出都包含文档的缺失单词。您可以在'reduce'的开头写出< missingsource> ONCE来标记这些文件。



    (* 1)How to find the source in map(0.20 ):

      private String localname; 
    私人文本outkey = new Text();
    私人文字outvalue = new Text();
    ...
    public void setup(Context context)throws InterruptedException,IOException {
    super.setup(context);

    localname =((FileSplit)context.getInputSplit())。getPath()。toString();

    $ b $ public void map(Object key,Text value,Context context)
    throws IOException,InterruptedException {
    ...
    outkey.set ...);
    outvalue.set(localname);
    context.write(outkey,outvalue);
    }


    I want to build a hadoop application which can read words from one file and search in another file.

    If the word exists - it has to write to one output file If the word doesn't exist - it has to write to another output file

    I tried a few examples in hadoop. I have two questions

    Two files are approximately 200MB each. Checking every word in another file might cause out of memory. Is there an alternative way of doing this?

    How to write data to different files because output of the reduce phase of hadoop writes to only one file. Is it possible to have a filter for reduce phase to write data to different output files?

    Thank you.

    解决方案

    How I would do it:

    1. split value in 'map' by words, emit (<word>, <source>) (*1)
    2. you'll get in 'reduce': (<word>, <list of sources>)
    3. check source-list (might be long for both/all sources)
    4. if NOT all sources are in the list, emit every time (<missingsource>, <word>)
    5. job2: job.setNumReduceTasks(<numberofsources>)
    6. job2: emit in 'map' (<missingsource>, <word>)
    7. job2: emit for each <missingsource> in 'reduce' all (null, <word>)

    You'll end up with as much reduce-outputs as different <missingsources>, each containing the missing words for the document. You could write out the <missingsource> ONCE at the beginning of 'reduce' to mark the files.

    (*1) Howto find out the source in map (0.20):

    private String localname;
    private Text outkey = new Text();   
    private Text outvalue = new Text();
    ...
    public void setup(Context context) throws InterruptedException, IOException {
        super.setup(context);
    
        localname = ((FileSplit)context.getInputSplit()).getPath().toString();
    }
    
    public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException {
    ...
        outkey.set(...);
        outvalue.set(localname);
        context.write(outkey, outvalue);
    }
    

    这篇关于Hadoop从另一个文件中的一个文件中搜索单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆