如何在映射器或减速器中运行外部程序,将 HDFS 文件作为输入并将输出文件存储在 HDFS 中? [英] How to run external program within mapper or reducer giving HDFS files as input and storing output files in HDFS?

查看:21
本文介绍了如何在映射器或减速器中运行外部程序,将 HDFS 文件作为输入并将输出文件存储在 HDFS 中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个外部程序,它将文件作为输入并给出输出文件

I have a external program which take file as a input and give output file

     //for example 
     input file: IN_FILE
     output file: OUT_FILE

    //Run External program 
     ./vx < ${IN_FILE} > ${OUT_FILE}

我想要 HDFS 中的输入和输出文件

I want both input and output files in HDFS

我有 8 个节点的集群.我有 8 个输入文件,每个文件有 1 行

I have cluster with 8 nodes.And I have 8 input files each have 1 line

    //1 input file :       1.txt 
           1:0,0,0
    //2 input file :       2.txt 
           2:0,0,128
    //3 input file :       3.txt 
           3:0,128,0
    //5 input file :       4.txt 
           4:0,128,128
    //5 input file :       5.txt 
           5:128,0,0
    //6 input file :       6.txt 
           6:128,0,128
    //7 input file :       7.txt 
           7:128,128,0
    //8 input file :       8.txt 
           8:128,128,128

我正在使用 KeyValueTextInputFormat

I am using KeyValueTextInputFormat

               key :file name
               value: initial coordinates

例如第 5 个文件

              key :5
              value:128,0,0

每个地图任务根据其初始坐标生成大量数据.

each map tasks generate huge amount of data according to their initial coordinates.

现在我想在每个地图任务中运行外部程序并生成输出文件.

Now I want to run external program in each map task and generate output file.

但我对如何处理 HDFS 中的文件感到困惑.

But I am confuse how to do that with files in HDFS .

         I can use zero reducer and create file in HDFS 

         Configuration conf = new Configuration();
         FileSystem fs = FileSystem.get(conf);
         Path outFile;
         outFile = new Path(INPUT_FILE_NAME);
         FSDataOutputStream out = fs.create(outFile);

         //generating data ........ and writing to HDFS 
          out.writeUTF(lon + ";" + lat + ";" + depth + ";");

我很困惑如何使用 HDFS 文件运行外部程序而不将文件放入本地目录.

I am confuse how to run external program with HDFS file without getting file into file into local directory .

  with  dfs -get 

在不使用 MR 的情况下,我使用 shell 脚本得到如下结果

Without using MR I am getting results with shell script as following

#!/bin/bash

if [ $# -lt 2 ]; then
    printf "Usage: %s: <infile> <outfile> 
" $(basename $0) >&2
          exit 1
fi

IN_FILE=/Users/x34/data/$1
OUT_FILE=/Users/x34/data/$2                     

cd "/Users/x34/Projects/externalprogram/model/"

./vx < ${IN_FILE} > ${OUT_FILE}

paste ${IN_FILE} ${OUT_FILE} | awk '{print $1,"	",$2,"	",$3,"	",$4,"	",$5,"	",$22,"	",$23,"	",$24}' > /Users/x34/data/combined
if [ $? -ne 0 ]; then
    exit 1
fi                      

exit 0

然后我运行它

         ProcessBuilder pb = new ProcessBuilder("SHELL_SCRIPT","in", "out"); 
         Process p = pb.start();

我非常感谢任何想法如何使用 hadoop 流或任何其他方式运行外部程序.我希望 HDFS 中的 INPUT 和 OUTPUT 文件进行进一步处理.

I would much appreciate any idea how to use hadoop streaming or any other way to run external program .I want both INPUT and OUTPUT files in HDFS for further processing .

请帮忙

推荐答案

您可以为此使用 Hadoop Streaming:

You could emply Hadoop Streaming for that:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
-input myInputDirs 
-output myOutputDir 
-mapper myPythonScript.py 
-reducer /bin/wc 
-file myPythonScript.py 
-file myDictionary.txt

请参阅 https://hadoop.apache.org/docs/r1.0.4/streaming.pdf 一些例子.

也是一篇不错的文章:http://princetonits.com/blog/technology/hadoop-mapreduce-streaming-using-bash-script/

Hadoop 流是 Hadoop 发行版附带的实用程序.该实用程序允许您使用任何可执行文件或脚本作为映射器和/或减速器来创建和运行 Map/Reduce 作业.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

另一个例子:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
    -input myInputDirs 
    -output myOutputDir 
    -mapper /bin/cat 
    -reducer /bin/wc

在上面的示例中,mapper 和 reducer 都是可执行文件,它们从 stdin(逐行)读取输入并将输出发送到 stdout.该实用程序将创建一个 Map/Reduce 作业,将作业提交到适当的集群,并监控作业的进度,直到它完成.

In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

当为映射器指定可执行文件时,每个映射器任务将在映射器初始化时将可执行文件作为单独的进程启动.当映射器任务运行时,它将其输入转换为行并将这些行提供给进程的标准输入.同时,映射器从进程的标准输出中收集面向行的输出,并将每一行转换为键/值对,作为映射器的输出收集.默认情况下,直到第一个制表符的行的前缀是键,该行的其余部分(不包括制表符)将是值.如果该行中没有制表符,则将整行视为键,值为空.但是,这可以自定义,如下所述.

When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later.

当为reducer 指定一个可执行文件时,每个reducer 任务将作为一个单独的进程启动该可执行文件,然后reducer 被初始化.随着 reducer 任务的运行,它将其输入键/值对转换为行并将这些行提供给进程的标准输入.同时,reducer 从流程的 stdout 中收集面向行的输出,将每一行转换为键/值对,作为 reducer 的输出收集.默认情况下,到第一个制表符的行的前缀是键,行的其余部分(不包括制表符)是值.但是,这可以自定义.

When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized.

这篇关于如何在映射器或减速器中运行外部程序,将 HDFS 文件作为输入并将输出文件存储在 HDFS 中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆