从多线程驱动程序启动Apache Spark SQL作业 [英] Launching Apache Spark SQL jobs from multi-threaded driver

查看:104
本文介绍了从多线程驱动程序启动Apache Spark SQL作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Spark从大约1500个远程Oracle表中提取数据,并且我想要一个多线程应用程序,该应用程序每个线程或每个线程可以选择10个表,并启动spark作业以从中读取数据.各自的表.

I was wanting to pull data from about 1500 remote Oracle tables with Spark, and I want to have a multi-threaded application that picks up a table per thread or maybe 10 tables per thread and launches a spark job to read from their respective tables.

从Spark官方网站 https://spark.apache.org/docs /latest/job-scheduling.html 很明显,这可以工作...

From official spark site https://spark.apache.org/docs/latest/job-scheduling.html it's clear that this can work...

...运行Spark的集群管理器为跨应用程序调度提供了便利.其次,在每个Spark应用程序中,如果多个作业"(Spark动作)是由不同的线程提交的,则它们可能同时运行.如果您的应用程序正在通过网络处理请求,这是很常见的. Spark包含一个公平的调度程序,用于调度每个SparkContext中的资源.

...cluster managers that Spark runs on provide facilities for scheduling across applications. Second, within each Spark application, multiple "jobs" (Spark actions) may be running concurrently if they were submitted by different threads. This is common if your application is serving requests over the network. Spark includes a fair scheduler to schedule resources within each SparkContext.

不过,您可能在这篇SO帖子 Spark中的并行作业执行中注意到了,这个相似的问题没有公认的答案,而最受支持的答案以

However you might have noticed in this SO post Concurrent job Execution in Spark that there was no accepted answer on this similar question and the most upvoted answer starts with

这实际上不是Spark的精神

This is not really in the spirit of Spark

  1. 每个人都知道这不是Spark的精神"
  2. 谁在乎Spark的精神是什么?这实际上没有任何意义

有人曾经有过类似的工作吗?你有什么特别的事吗?在我浪费大量工作时间进行原型制作之前,只想提供一些建议.我真的很感谢任何帮助!

Has anyone gotten something like this to work before? Did you have to do anything special? Just wanted some pointers before I wasted a lot of work hours prototyping. I would really appreciate any help on this!

推荐答案

spark上下文是线程安全的,因此可以从多个线程中并行调用它. (我正在生产中)

The spark context is thread safe, so it's possible to call it from many threads in parallel. (I am doing it in production)

要注意的一件事是限制您正在运行的线程数,原因是:
1.执行程序的内存在所有线程之间共享,您可能会获得OOM或从高速缓存中不断地换入和换出内存
2. cpu是有限的,因此拥有比核心更多的任务不会有任何改善

One thing to be aware of, is to limit the number of thread you have running, because:
1. the executor memory is shared between all threads, and you might get OOM or constantly swap in and out memory from the cache
2. the cpu is limited, so having more tasks than core won't have any improvement

这篇关于从多线程驱动程序启动Apache Spark SQL作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆