计划Spark Job Java [英] Scheduling a Spark Job Java

查看:118
本文介绍了计划Spark Job Java的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Spark作业,该作业读取HBase表,一些聚合并将数据存储到mongoDB.当前,此作业正在使用spark-submit脚本手动运行.我想安排它运行固定的时间间隔.

如何使用Java实现此目的.

有图书馆吗? 还是可以使用Java中的Thread来做到这一点?

任何建议表示赞赏!

解决方案

如果您仍然想使用spark-submit,我宁愿使用 crontab 或类似的方式并运行bash脚本. >

但是,如果您需要从Java运行"spark-submit",则可以查看

但是您的问题是关于一些调度库的.您可以将简单的Timer与Java util(java.util.TimerTask)中提供的Date一起使用,但是我更喜欢使用 Quartz Job Scheduling Library -确实很受欢迎(据我所知spring http://quartz-scheduler.org ) ....

使用Quartz,您可以设置cron计划,对我来说,它更容易 与石英一起工作.

只需添加Maven依赖项

<!-- https://mvnrepository.com/artifact/org.quartz-scheduler/quartz -->
<dependency>
    <groupId>org.quartz-scheduler</groupId>
    <artifactId>quartz</artifactId>
    <version>2.2.3</version>
</dependency>

创建火花-石英作业

   public class SparkLauncherQuartzJob implements Job {
         startApacheSparkApplication();
   ...

现在创建一个触发器并安排它

 // trigger runs every hour
 Trigger trigger = new Trigger() 
             .withIdentity("sparkJob1Trigger", "sparkJobsGroup")
             .withSchedule(
                 CronScheduleBuilder.cronSchedule("0 * * * * ?"))
             .build();


  JobDetail sparkQuartzJob = JobBuilder.newJob(SparkLauncherQuartzJob.class).withIdentity("SparkLauncherQuartzJob", "sparkJobsGroup").build();

  Scheduler scheduler = new StdSchedulerFactory().getScheduler();
  scheduler.start();
  scheduler.scheduleJob(sparkQuartzJob , trigger);

不太可能 -如果您具有Spring Boot应用程序,则可以使用调度程序非常轻松地运行某些方法-配置中只需@EnableScheduling,如下所示:

@Scheduled(fixedRate = 300000)
public void periodicalRunningSparkJob() {
    log.info("Spark job periodically execution");
    startApacheSparkApplication();
}

I have a Spark job which reads an HBase table, some aggregations and store data to mongoDB. Currently this job is running manually using the spark-submit script. I want to schedule it to run for a fixed interval.

How can I achieve this using java.

Any library? Or Can I do this with Thread in java?

Any suggestions appreciated!

解决方案

If you want to still use spark-submit I would rather prefer crontab or something similar and run bash script for example.

But if you need to run "spark-submit" from java you can take a look to Package org.apache.spark.launcher. With this approach you can start application programatically with SparkLauncher.

import org.apache.spark.launcher.SparkAppHandle;
import org.apache.spark.launcher.SparkLauncher;

...

     public void startApacheSparkApplication(){
        SparkAppHandle handler = new SparkLauncher()
         .setAppResource("pathToYourSparkApp.jar")
         .setMainClass("your.package.main.Class")
         .setMaster("local")
         .setConf(...)
         .startApplication(); // <-- and start spark job app
     }
...

But your question was about some scheduling library. You can use a simple Timer with Date provided in java util (java.util.TimerTask), but I would prefer to use Quartz Job Scheduling Library - it is really popular (As I know spring uses Quartz Scheduler too).

Spring also features integration classes for supporting scheduling with the Timer, part of the JDK since 1.3, and the Quartz Scheduler ( http://quartz-scheduler.org) ....

With Quartz you can set cron scheduling and for me it is more easier to work with quartz.

Just add maven dependency

<!-- https://mvnrepository.com/artifact/org.quartz-scheduler/quartz -->
<dependency>
    <groupId>org.quartz-scheduler</groupId>
    <artifactId>quartz</artifactId>
    <version>2.2.3</version>
</dependency>

create spark - Quartz job

   public class SparkLauncherQuartzJob implements Job {
         startApacheSparkApplication();
   ...

now create a trigger and schedule it

 // trigger runs every hour
 Trigger trigger = new Trigger() 
             .withIdentity("sparkJob1Trigger", "sparkJobsGroup")
             .withSchedule(
                 CronScheduleBuilder.cronSchedule("0 * * * * ?"))
             .build();


  JobDetail sparkQuartzJob = JobBuilder.newJob(SparkLauncherQuartzJob.class).withIdentity("SparkLauncherQuartzJob", "sparkJobsGroup").build();

  Scheduler scheduler = new StdSchedulerFactory().getScheduler();
  scheduler.start();
  scheduler.scheduleJob(sparkQuartzJob , trigger);

Unlikely - If you have spring boot application you can use scheduling for running some methods very easy - just @EnableScheduling in configuration and something like this:

@Scheduled(fixedRate = 300000)
public void periodicalRunningSparkJob() {
    log.info("Spark job periodically execution");
    startApacheSparkApplication();
}

这篇关于计划Spark Job Java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆