如何在Spark中并行读写多个表? [英] How to read and write multiple tables in parallel in Spark?

查看:121
本文介绍了如何在Spark中并行读写多个表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的 Spark 应用程序中,我试图从 RDBMS 读取多个表,进行一些数据处理,然后将多个表写入另一个 RDBMS,如下(在 Scala 中):

In my Spark application, I am trying to read multiple tables from RDBMS, doing some data processing, then write multiple tables to another RDBMS as follows (in Scala):

val reading1 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable1))
val reading2 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable2))
val reading3 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable3))

// data processing
// ..............

myDF1.write.mode("append").jdbc(myurl2, outtable1, new java.util.Properties)
myDF2.write.mode("append").jdbc(myurl2, outtable2, new java.util.Properties)
myDF3.write.mode("append").jdbc(myurl2, outtable3, new java.util.Properties)

我知道可以使用分区并行读取一张表.但是,reading1、reading2、reading3的读操作似乎是顺序的,myDF1、myDF2、myDF3的写操作也是如此.

I understand that reading from one table can be paralleled using partitions. However, the read operations of reading1, reading2, reading3 seem sequential, so do the write operations of myDF1, myDF2, myDF3.

如何并行读取多个表(mytable1、mytable2、mytable3)?并并行写入多个表(我认为逻辑相同)?

How can I read from the multiple tables (mytable1, mytable2, mytable3) in parallel? and also write to multiple tables in parallel (I think same logic)?

推荐答案

您可以将调度模式设为 FAIR,它应该并行运行任务.https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

You can schedule mode to be FAIR, it should run the tasks in parallel. https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

在应用程序内调度在给定的 Spark 应用程序(SparkContext 实例)中,如果多个并行作业从不同的线程提交,则它们可以同时运行.在本节中,作业"是指 Spark 操作(例如保存、收集)以及需要运行以评估该操作的任何任务.Spark 的调度程序是完全线程安全的,并支持此用例以启用服务多个请求(例如,多个用户的查询)的应用程序.

Scheduling Within an Application Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By "job", in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

默认情况下,Spark 的调度程序以 FIFO 方式运行作业.每个job被分成stages"(例如map和reduce阶段),第一个job获得所有可用资源的优先权,而它的stages有任务要启动,那么第二个job获得优先权,依此类推.队列不需要使用整个集群,后面的作业可以马上开始运行,但是如果队列头部的作业很大,那么后面的作业可能会被明显延迟.

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

从 Spark 0.8 开始,也可以配置作业之间的公平共享.在公平共享下,Spark 以循环"方式在作业之间分配任务,以便所有作业获得大致相等的集群资源份额.这意味着在长作业运行时提交的短作业可以立即开始接收资源并仍然获得良好的响应时间,而无需等待长作业完成.此模式最适合多用户设置.

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a "round robin" fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

这篇关于如何在Spark中并行读写多个表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆