如何在Spark中并行读写多个表? [英] How to read and write multiple tables in parallel in Spark?

查看:1125
本文介绍了如何在Spark中并行读写多个表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的Spark应用程序中,我试图从RDBMS读取多个表,进行一些数据处理,然后将多个表写入另一个RDBMS,如下所示(在Scala中):

In my Spark application, I am trying to read multiple tables from RDBMS, doing some data processing, then write multiple tables to another RDBMS as follows (in Scala):

val reading1 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable1))
val reading2 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable2))
val reading3 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable3))

// data processing
// ..............

myDF1.write.mode("append").jdbc(myurl2, outtable1, new java.util.Properties)
myDF2.write.mode("append").jdbc(myurl2, outtable2, new java.util.Properties)
myDF3.write.mode("append").jdbc(myurl2, outtable3, new java.util.Properties)

我知道可以使用分区并行读取一个表.但是,read1,read2,read3的读操作似乎是顺序的,myDF1,myDF2,myDF3的写操作也是如此.

I understand that reading from one table can be paralleled using partitions. However, the read operations of reading1, reading2, reading3 seem sequential, so do the write operations of myDF1, myDF2, myDF3.

如何并行读取多个表(mytable1,mytable2,mytable3)?并同时并行写入多个表(我认为是相同的逻辑)?

How can I read from the multiple tables (mytable1, mytable2, mytable3) in parallel? and also write to multiple tables in parallel (I think same logic)?

推荐答案

您可以将模式安排为FAIR,它应该并行运行任务. https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-in-an-application

You can schedule mode to be FAIR, it should run the tasks in parallel. https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

在应用程序内进行计划 在给定的Spark应用程序(SparkContext实例)中,如果多个并行作业是从单独的线程提交的,则它们可以同时运行.在本节中,作业"指的是Spark动作(例如保存,收集)以及需要运行以评估该动作的所有任务. Spark的调度程序是完全线程安全的,并支持该用例,以启用能够处理多个请求(例如,针对多个用户的查询)的应用程序.

Scheduling Within an Application Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By "job", in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

默认情况下,Spark的调度程序以FIFO方式运行作业.每个作业都分为多个阶段"(例如,映射和简化阶段),第一个作业在所有可用资源上都具有优先级,而其各个阶段都有要启动的任务,则第二个作业具有优先级,依此类推.队列不需要使用整个集群,以后的作业可以立即开始运行,但是如果队列开头的作业很大,那么以后的作业可能会大大延迟.

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

从Spark 0.8开始,还可以配置作业之间的公平共享.在公平共享下,Spark以循环"方式在作业之间分配任务,以便所有作业都获得大致相等的群集资源份额.这意味着在运行长作业时提交的短作业可以立即开始接收资源,并且仍然获得良好的响应时间,而无需等待长作业完成.此模式最适合多用户设置.

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a "round robin" fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

这篇关于如何在Spark中并行读写多个表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆