Hadoop Streaming with Python Mapper的多个输出文件 [英] Multiple Output Files for Hadoop Streaming with Python Mapper

查看:516
本文介绍了Hadoop Streaming with Python Mapper的多个输出文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



/在Hadoop流中生成单独的输出文件



我的用例如下:

/ p>

我有一个map-only mapreduce作业,它接受一个输入文件,进行大量解析和消除,然后写回。但是,某些行可能不是不正确的格式,如果是这种情况,我想将原始行写入单独的文件。



它似乎做到这一点的一种方法是将文件的名称预先添加到我打印的行并使用multipleOutputFormat参数。例如,如果我原本有:

  if line_is_valid(line):
print name +'\t' +评论

我可以改为:

<$如果line_is_valid(行):
print valid_file_name +'\ t'+ name +'\ t'+ comment
else:
print err_file_name p $ p> +'\ t'+ line

这个解决方案唯一的问题是, t希望file_name显示为文本文件中的第一列。我想我可以运行另一项工作去除每个文件的第一列,但这看起来很愚蠢。所以:

1)这是用python mapreduce作业管理多个输出文件的正确方法吗?



2)什么是摆脱最初的列的最好方法是什么?

解决方案

你可以做下面的事情,但它涉及到一点Java编译,如果你想用Python来处理用例,那么我认为这不应该是个问题 -
从Python来说,据我所知,不能直接从最后跳过文件名输出作为你的用例需求在一个单一的工作。但是下面显示的内容可以让它轻松实现!



以下是需要编译的Java类 -

  package com.custom; 
导入org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;

公共类CustomMultiOutputFormat扩展了MultipleTextOutputFormat< Text,Text> {
/ **
*使用它们作为最终输出文件路径的一部分。
* /
@Override
protected String generateFileNameForKeyValue(Text key,Text value,String leaf){
return new Path(key.toString(),leaf).toString() ;
}

/ **
*我们根据您的要求丢弃密钥
* /
@Override
protected Text generateActualKey(Text键,文本值){
返回null;
}
}

编译步骤:


  1. 将文本完全保存到一个文件(没有不同的名称)
    CustomMultiOutputFormat.java

  2. 在上面保存的文件所在的目录中,键入 -



    $ JAVA_HOME / bin / javac -cp $(hadoop classpath)-d。 CustomMultiOutputFormat.java


  3. 在尝试
    之前,确保JAVA_HOME设置为/ path / to / your / SUNJDK命令。

  4. 使用custom.jar文件(正好输入) -


    $ b

    $ JAVA_HOME / bin / jar cvf custom.jar com / custom / CustomMultiOutputFormat.class


  5. 最后,运行你的工作像 -



    hadoop jar /path/to/your/hadoop-streaming-*.jar -libjars custom.jar -outputformat com.custom。 CustomMultiOutputFormat -file your_script.py -input inputpath --numReduceTasks 0 -output outputpath -mapper your_script.py


完成这些后,您应该在 outputpath 中看到两个目录,其中一个使用 valid_file_name ,另一个使用 err_file_name 。所有将valid_file_name作为标记的记录将转到valid_file_name目录,并且所有具有err_file_name的记录将转到err_file_name目录。



我希望所有这些都有意义。 b $ b

I am looking for a little clarification on the the answers to this question here:

Generating Separate Output files in Hadoop Streaming

My use case is as follows:

I have a map-only mapreduce job that takes an input file, does a lot of parsing and munging, and then writes back out. However, certain lines may or may not be in an incorrect format, and if that is the case, I would like to write the original line to a separate file.

It seems that one way to do this would be to prepend the name of the file to the line I am printing and use the multipleOutputFormat parameter. For example, if I originally had:

if line_is_valid(line):
    print name + '\t' + comments

I could instead do:

if line_is_valid(line):
    print valid_file_name + '\t' + name + '\t' + comments
else:
    print err_file_name + '\t' + line

The only problem I have with this solution is that I don't want the file_name to appear as the first column in the textfiles. I suppose I could then run another job to strip out the first column of each file, but that seems kind of silly. So:

1) Is this the correct way to manage multiple output files with a python mapreduce job?

2) What is the best way to get rid of that initial column?

解决方案

You can do something like the following, but it involves a little Java compiling, which I think shouldn't be a problem, if you want your use case done anyway with Python- From Python, as far as I know it's not directly possible to skip the filename from the final output as your use case demands in a single job. But what's shown below can make it possible with ease!

Here is the Java class that's need to compiled -

package com.custom;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;

 public class CustomMultiOutputFormat extends MultipleTextOutputFormat<Text, Text> {
        /**
        * Use they key as part of the path for the final output file.
        */
       @Override
       protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
             return new Path(key.toString(), leaf).toString();
       }

       /**
        * We discard the key as per your requirement
        */
       @Override
       protected Text generateActualKey(Text key, Text value) {
             return null;
       }
 }

Steps to compile:

  1. Save the text to a file exactly (no different name) CustomMultiOutputFormat.java
  2. While you are in the directory where the above saved file is, type -

    $JAVA_HOME/bin/javac -cp $(hadoop classpath) -d . CustomMultiOutputFormat.java

  3. Make sure JAVA_HOME is set to /path/to/your/SUNJDK before attempting the above command.

  4. Make your custom.jar file using (type exactly) -

    $JAVA_HOME/bin/jar cvf custom.jar com/custom/CustomMultiOutputFormat.class

  5. Finally, run your job like -

    hadoop jar /path/to/your/hadoop-streaming-*.jar -libjars custom.jar -outputformat com.custom.CustomMultiOutputFormat -file your_script.py -input inputpath --numReduceTasks 0 -output outputpath -mapper your_script.py

After doing these you should see two directories inside your outputpath one with valid_file_name and other with err_file_name. All records having valid_file_name as a tag will go to valid_file_name directory and all records having err_file_name would go to err_file_name directory.

I hope all these makes sense.

这篇关于Hadoop Streaming with Python Mapper的多个输出文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆