计划Spark Job Java [英] Scheduling a Spark Job Java
问题描述
我有一个Spark作业,该作业读取HBase表,一些聚合并将数据存储到mongoDB.当前,此作业正在使用spark-submit脚本手动运行.我想安排它运行固定的时间间隔.
如何使用Java实现此目的.
有图书馆吗? 还是可以使用Java中的Thread来做到这一点?
任何建议表示赞赏!
如果您仍然想使用spark-submit
,我宁愿使用 crontab 或类似的方式并运行bash脚本. >
但是,如果您需要从Java运行"spark-submit",则可以查看 但是您的问题是关于一些调度库的.您可以将简单的 使用Quartz,您可以设置cron计划,对我来说,它更容易
与石英一起工作. 只需添加Maven依赖项 创建火花-石英作业 现在创建一个触发器并安排它 不太可能 -如果您具有Spring Boot应用程序,则可以使用调度程序非常轻松地运行某些方法-配置中只需 I have a Spark job which reads an HBase table, some aggregations and store data to mongoDB. Currently this job is running manually using the spark-submit script. I want to schedule it to run for a fixed interval. How can I achieve this using java. Any library?
Or Can I do this with Thread in java? Any suggestions appreciated! If you want to still use But if you need to run "spark-submit" from java you can take a look to Package org.apache.spark.launcher. With this approach you can start application programatically with But your question was about some scheduling library. You can use a simple Spring also features integration classes for supporting scheduling
with the Timer, part of the JDK since 1.3, and the Quartz Scheduler (
http://quartz-scheduler.org)
.... With Quartz you can set cron scheduling and for me it is more easier
to work with quartz. Just add maven dependency create spark - Quartz job now create a trigger and schedule it Unlikely - If you have spring boot application you can use scheduling for running some methods very easy - just
这篇关于计划Spark Job Java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!Timer
与Java util(java.util.TimerTask
)中提供的Date
一起使用,但是我更喜欢使用 Quartz Job Scheduling Library -确实很受欢迎(据我所知spring http://quartz-scheduler.org )
.... <!-- https://mvnrepository.com/artifact/org.quartz-scheduler/quartz -->
<dependency>
<groupId>org.quartz-scheduler</groupId>
<artifactId>quartz</artifactId>
<version>2.2.3</version>
</dependency>
public class SparkLauncherQuartzJob implements Job {
startApacheSparkApplication();
...
// trigger runs every hour
Trigger trigger = new Trigger()
.withIdentity("sparkJob1Trigger", "sparkJobsGroup")
.withSchedule(
CronScheduleBuilder.cronSchedule("0 * * * * ?"))
.build();
JobDetail sparkQuartzJob = JobBuilder.newJob(SparkLauncherQuartzJob.class).withIdentity("SparkLauncherQuartzJob", "sparkJobsGroup").build();
Scheduler scheduler = new StdSchedulerFactory().getScheduler();
scheduler.start();
scheduler.scheduleJob(sparkQuartzJob , trigger);
@EnableScheduling
,如下所示:@Scheduled(fixedRate = 300000)
public void periodicalRunningSparkJob() {
log.info("Spark job periodically execution");
startApacheSparkApplication();
}
spark-submit
I would rather prefer crontab or something similar and run bash script for example.SparkLauncher
.import org.apache.spark.launcher.SparkAppHandle;
import org.apache.spark.launcher.SparkLauncher;
...
public void startApacheSparkApplication(){
SparkAppHandle handler = new SparkLauncher()
.setAppResource("pathToYourSparkApp.jar")
.setMainClass("your.package.main.Class")
.setMaster("local")
.setConf(...)
.startApplication(); // <-- and start spark job app
}
...
Timer
with Date
provided in java util (java.util.TimerTask
), but I would prefer to use Quartz Job Scheduling Library - it is really popular (As I know spring uses Quartz Scheduler too).
<!-- https://mvnrepository.com/artifact/org.quartz-scheduler/quartz -->
<dependency>
<groupId>org.quartz-scheduler</groupId>
<artifactId>quartz</artifactId>
<version>2.2.3</version>
</dependency>
public class SparkLauncherQuartzJob implements Job {
startApacheSparkApplication();
...
// trigger runs every hour
Trigger trigger = new Trigger()
.withIdentity("sparkJob1Trigger", "sparkJobsGroup")
.withSchedule(
CronScheduleBuilder.cronSchedule("0 * * * * ?"))
.build();
JobDetail sparkQuartzJob = JobBuilder.newJob(SparkLauncherQuartzJob.class).withIdentity("SparkLauncherQuartzJob", "sparkJobsGroup").build();
Scheduler scheduler = new StdSchedulerFactory().getScheduler();
scheduler.start();
scheduler.scheduleJob(sparkQuartzJob , trigger);
@EnableScheduling
in configuration and something like this:@Scheduled(fixedRate = 300000)
public void periodicalRunningSparkJob() {
log.info("Spark job periodically execution");
startApacheSparkApplication();
}