从简单的Java程序调用mapreduce作业 [英] Calling a mapreduce job from a simple java program

查看:97
本文介绍了从简单的Java程序调用mapreduce作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直试图从一个简单的java程序中调用一个mapreduce作业,我尝试在我的java程序中引用mapreduce jar文件,并使用 runJar(String args [])方法,通过传递mapreduce作业的输入和输出路径。但程序dint工作..






我如何运行这样的程序,我只是使用pass输入,输出和jar路径到它的主要方法?是否有可能通过它运行mapreduce作业(jar)?我想这样做是因为我想要一个接一个地运行几个mapreduce作业,其中我的java程序vl通过引用它的jar文件来调用每个这样的作业。如果这有可能,那么我可能只需使用一个简单的servlet来执行此类调用并将其输出文件用于图表目的。






  / * 
*要更改此模板,请选择工具|模板
*并在编辑器中打开模板。
* /

/ **
*
* @author root
* /
import org.apache.hadoop.util.RunJar ;
import java.util。*;

public class callOther {

public static void main(String args [])throws Throwable
{

ArrayList arg = new ArrayList ();

字符串输出=/ root / Desktp / output;

arg.add(/ root / NetBeansProjects / wordTool / dist / wordTool.jar);

arg.add(/ root / Desktop / input);
arg.add(输出);

RunJar.main((String [])arg.toArray(new String [0]));


$ b


解决方案

哦,请不要使用 runJar 来实现,Java API非常好。



看看如何您可以从普通代码开始工作:

  //创建配置
配置conf = new Configuration();
//根据配置创建一个新的作业
Job job = new Job(conf);
//在这里你必须把你的映射类
job.setMapperClass(Mapper.class);
//在这里你必须把你的reducer class
job.setReducerClass(Reducer.class);
//这里你必须设置包含你的
// map / reduce类的jar,所以你可以使用mapper类
job.setJarByClass(Mapper.class);
//减速器输出的键/值
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//这是设置输入的格式,可以是TextInputFormat
job.setInputFormatClass(SequenceFileInputFormat.class);
//与输出相同
job.setOutputFormatClass(TextOutputFormat.class);
//在这里你可以设置你输入的路径
SequenceFileInputFormat.addInputPath(job,new Path(files / toMap /));
//这将删除可能的输出路径以防止作业失败
FileSystem fs = FileSystem.get(conf);
Path out = new Path(files / out / processed /);
fs.delete(out,true);
//最后设置空出路径
TextOutputFormat.setOutputPath(job,out);

//这将一直等到作业完成并将调试输出到STDOUT或任何
//已在您的log4j属性中配置。
job.waitForCompletion(true);

如果您正在使用外部群集,则必须通过以下命令将以下信息添加到您的配置中:

  //这应该像在您的mapred-site.xml中定义的
conf.set(mapred.job。跟踪器,jobtracker.com:50001);
//像在hdfs-site.xml中定义的那样
conf.set(fs.default.name,hdfs://namenode.com:9000);

hadoop-core.jar 在您的应用程序容器类路径中。
但是我认为你应该在你的网页上放一些进度指示器,因为完成一个hadoop工作可能需要几分钟到几小时;)

<对于YARN(> Hadoop 2)



对于YARN,需要设置以下配置。

  //这应该像在yarn-site.xml中定义的那样
conf.set(yarn.resourcemanager.address,yarn-manager.com: 50001\" );

//框架现在是纱线,应该像这样在mapred-site.xm中定义
conf.set(mapreduce.framework.name,yarn);

//在hdfs-site.xml中定义
conf.set(fs.default.name,hdfs://namenode.com:9000);


I have been trying to call a mapreduce job from a simple java program in the same package.. I tried to refer the mapreduce jar file in my java program and call it using the runJar(String args[]) method by also passing the input and output paths for the mapreduce job.. But the program dint work..


How do I run such a program where I just use pass input, output and jar path to its main method?? Is it possible to run a mapreduce job (jar) through it?? I want to do this because I want to run several mapreduce jobs one after another where my java program vl call each such job by referring its jar file.. If this gets possible, I might as well just use a simple servlet to do such calling and refer its output files for the graph purpose..


/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

/**
 *
 * @author root
 */
import org.apache.hadoop.util.RunJar;
import java.util.*;

public class callOther {

    public static void main(String args[])throws Throwable
    {

        ArrayList arg=new ArrayList();

        String output="/root/Desktp/output";

        arg.add("/root/NetBeansProjects/wordTool/dist/wordTool.jar");

        arg.add("/root/Desktop/input");
        arg.add(output);

        RunJar.main((String[])arg.toArray(new String[0]));

    }
}

解决方案

Oh please don't do it with runJar, the Java API is very good.

See how you can start a job from normal code:

// create a configuration
Configuration conf = new Configuration();
// create a new job based on the configuration
Job job = new Job(conf);
// here you have to put your mapper class
job.setMapperClass(Mapper.class);
// here you have to put your reducer class
job.setReducerClass(Reducer.class);
// here you have to set the jar which is containing your 
// map/reduce class, so you can use the mapper class
job.setJarByClass(Mapper.class);
// key/value of your reducer output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// this is setting the format of your input, can be TextInputFormat
job.setInputFormatClass(SequenceFileInputFormat.class);
// same with output
job.setOutputFormatClass(TextOutputFormat.class);
// here you can set the path of your input
SequenceFileInputFormat.addInputPath(job, new Path("files/toMap/"));
// this deletes possible output paths to prevent job failures
FileSystem fs = FileSystem.get(conf);
Path out = new Path("files/out/processed/");
fs.delete(out, true);
// finally set the empty out path
TextOutputFormat.setOutputPath(job, out);

// this waits until the job completes and prints debug out to STDOUT or whatever
// has been configured in your log4j properties.
job.waitForCompletion(true);

If you are using an external cluster, you have to put the following infos to your configuration via:

// this should be like defined in your mapred-site.xml
conf.set("mapred.job.tracker", "jobtracker.com:50001"); 
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");

This should be no problem when the hadoop-core.jar is in your application containers classpath. But I think you should put some kind of progress indicator to your web page, because it may take minutes to hours to complete a hadoop job ;)

For YARN (> Hadoop 2)

For YARN, the following configurations need to be set.

// this should be like defined in your yarn-site.xml
conf.set("yarn.resourcemanager.address", "yarn-manager.com:50001"); 

// framework is now "yarn", should be defined like this in mapred-site.xm
conf.set("mapreduce.framework.name", "yarn");

// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");

这篇关于从简单的Java程序调用mapreduce作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆