如何在映射器或Reducer中运行外部程序,将HDFS文件作为输入并将输出文件存储在HDFS中? [英] How to run external program within mapper or reducer giving HDFS files as input and storing output files in HDFS?

查看:139
本文介绍了如何在映射器或Reducer中运行外部程序,将HDFS文件作为输入并将输出文件存储在HDFS中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个外部程序,以文件作为输入并给出输出文件

  //例如
输入文件:IN_FILE
输出文件:OUT_FILE

//运行外部程序
./vx< $ {IN_FILE}> $ {OUT_FILE}

我希望在HDFS中输入和输出文件



我有8个节点的簇。而且我有8个输入文件,每个文件有1行

  // 1输入文件:1.txt 
1:0,0,0
// 2输入文件:2.txt
2:0,0128
// 3输入文件: 3.txt
3:0,128,0
// 5输入文件:4.txt
4:0,128,128
// 5输入文件:5.txt
5 :128,0,0
// 6输入文件:6.txt
6:128,0,128
// 7输入文件:7.txt
7:128,128,0
// 8输入文件:8.txt
8:128,128,128

I我使用KeyValueTextInputFormat

 键:文件名
值:初始坐标

例如5th文件

  k ey:5 
value:128,0,0

每个地图任务会产生大量的数据根据他们的初始坐标。

现在我想在每个地图任务中运行外部程序并生成输出文件。



但我很困惑如何在HDFS中使用这些文件。

 我可以使用零缩减器并在HDFS中创建文件

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
路径outFile;
outFile =新路径(INPUT_FILE_NAME);
FSDataOutputStream out = fs.create(outFile);

//生成数据........并写入HDFS
out.writeUTF(lon +;+ lat +;+ depth +;) ;

我很困惑如何在没有将文件存入本地目录的情况下使用HDFS文件运行外部程序。 / p>

 与dfs -get 

没有使用MR,我使用shell脚本得到如下结果

 #!/ bin / bash 

if [$#-lt 2];然后
printf用法:%s:< infile>< outfile> \\\
$(basename $ 0)>& 2
exit 1
fi

IN_FILE = / Users / x34 / data / $ 1
OUT_FILE = / Users / x34 / data / $ 2

cd/ Users / x34 / Projects / externalprogram / model /

./vx< $ {IN_FILE}> $ {OUT_FILE}

粘贴$ {IN_FILE} $ {OUT_FILE} | awk'{print $ 1,\ t,$ 2,\ t,$ 3,\ t,$ 4,\ t,$ 5,\ t,$ 22,\ t ,$ 23,\ t,$ 24}'> / Users / x34 / data / combined
if [$? -ne 0];那么
出口1
fi

出口0

然后我使用

  ProcessBuilder pb = new ProcessBuilder(SHELL_SCRIPT,in,out)运行它。 
进程p = pb.start();

我非常感谢任何想法,如何使用hadoop streaming或其他方式来运行外部程序。希望HDFS中的INPUT和OUTPUT文件用于进一步处理。



请帮助

解决方案

您可以使用Hadoop Streaming:

$ $ $ $ $ HADOOP_HOME / bin / hadoop jar $ HADOOP_HOME / hadoop-streaming。 jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer / bin / wc \
-file myPythonScript.py \
-file myDictionary.txt

请参阅 https://hadoop.apache.org/docs/r1.0.4/streaming.pdf 一些例子。

另外一篇不错的文章: http:// princeton Hadoop流媒体是Hadoop发行版附带的一款实用工具。该实用程序允许您使用任何可执行文件或脚本作为映射器和/或缩减器来创建和运行Map / Reduce作业。



另外一个例子:

$ $ $ $ $ $ $ HADOOP_HOME / bin / hadoop jar $ HADOOP_HOME / hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper / bin / cat \
-reducer / bin / wc $在上面的例子中,mapper和reducer都是可执行文件,它们读取来自stdin的输入(line by行)并将输出发送到stdout。该实用程序将创建Map / Reduce作业,将作业提交到适当的群集,并监视作业的进度,直到作业完成。



当为可执行文件指定映射器,当映射器初始化时,每个映射器任务将作为单独的进程启动可执行文件。当映射程序任务运行时,它将其输入转换为行并将行提供给进程的stdin。与此同时,映射器从进程的stdout收集面向行的输出,并将每行转换为一个键/值对,将其作为映射器的输出进行收集。默认情况下,直到第一个制表符的行的前缀是关键字,而行的其余部分(不包括制表符)将是该值。如果行中没有制表符,则整行被认为是关键字,值为空。然而,这可以自定义,如后面讨论的。

为reducer指定可执行文件时,每个reducer任务将启动可执行文件作为单独的进程,然后reducer被初始化。在Reducer任务运行时,它将其输入键/值对转换为行并将行提供给进程的stdin。同时,减速器从过程的stdout收集面向行的输出,将每行转换为键/值对,将其作为减速器的输出收集。默认情况下,直到第一个制表符的行的前缀是关键字,行的其余部分(不包括制表符)是值。但是,这可以自定义。


I have a external program which take file as a input and give output file

     //for example 
     input file: IN_FILE
     output file: OUT_FILE

    //Run External program 
     ./vx < ${IN_FILE} > ${OUT_FILE}

I want both input and output files in HDFS

I have cluster with 8 nodes.And I have 8 input files each have 1 line

    //1 input file :       1.txt 
           1:0,0,0
    //2 input file :       2.txt 
           2:0,0,128
    //3 input file :       3.txt 
           3:0,128,0
    //5 input file :       4.txt 
           4:0,128,128
    //5 input file :       5.txt 
           5:128,0,0
    //6 input file :       6.txt 
           6:128,0,128
    //7 input file :       7.txt 
           7:128,128,0
    //8 input file :       8.txt 
           8:128,128,128

I am using KeyValueTextInputFormat

               key :file name
               value: initial coordinates

For example 5th file

              key :5
              value:128,0,0

each map tasks generate huge amount of data according to their initial coordinates.

Now I want to run external program in each map task and generate output file.

But I am confuse how to do that with files in HDFS .

         I can use zero reducer and create file in HDFS 

         Configuration conf = new Configuration();
         FileSystem fs = FileSystem.get(conf);
         Path outFile;
         outFile = new Path(INPUT_FILE_NAME);
         FSDataOutputStream out = fs.create(outFile);

         //generating data ........ and writing to HDFS 
          out.writeUTF(lon + ";" + lat + ";" + depth + ";");

I am confuse how to run external program with HDFS file without getting file into file into local directory .

  with  dfs -get 

Without using MR I am getting results with shell script as following

#!/bin/bash

if [ $# -lt 2 ]; then
    printf "Usage: %s: <infile> <outfile> \n" $(basename $0) >&2
          exit 1
fi

IN_FILE=/Users/x34/data/$1
OUT_FILE=/Users/x34/data/$2                     

cd "/Users/x34/Projects/externalprogram/model/"

./vx < ${IN_FILE} > ${OUT_FILE}

paste ${IN_FILE} ${OUT_FILE} | awk '{print $1,"\t",$2,"\t",$3,"\t",$4,"\t",$5,"\t",$22,"\t",$23,"\t",$24}' > /Users/x34/data/combined
if [ $? -ne 0 ]; then
    exit 1
fi                      

exit 0

And then I run it with

         ProcessBuilder pb = new ProcessBuilder("SHELL_SCRIPT","in", "out"); 
         Process p = pb.start();

I would much appreciate any idea how to use hadoop streaming or any other way to run external program .I want both INPUT and OUTPUT files in HDFS for further processing .

Please help

解决方案

You could emply Hadoop Streaming for that:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py \
-file myDictionary.txt

See https://hadoop.apache.org/docs/r1.0.4/streaming.pdf for some examples.

Also a nice article : http://princetonits.com/blog/technology/hadoop-mapreduce-streaming-using-bash-script/

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Another example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later.

When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized.

这篇关于如何在映射器或Reducer中运行外部程序,将HDFS文件作为输入并将输出文件存储在HDFS中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆