为什么我的Spark App仅在1个执行器上运行? [英] Why is my Spark App running in only 1 executor?

查看:302
本文介绍了为什么我的Spark App仅在1个执行器上运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Spark还是很陌生,但是我已经能够创建Spark应用程序,我需要能够使用JDBC驱动程序从SQL Server中重新处理数据(我们正在删除昂贵的SP),该应用程序加载了一些表通过JDBC从Sql Server中导入数据帧,然后我进行了几个连接,一个组和一个过滤器,最后通过JDBC将一些数据重新插入到另一个表中.所有这些操作在m3.xlarge上的Amazon Web Services中的Spark EMR上执行大约2分钟即可正常运行.

I'm still fairly new to Spark but I have been able to create the Spark App I need to be able to reprocess data from our SQL Server using JDBC drivers ( we are removing expensive SPs ), the app loads a few tables from Sql Server via JDBC into dataframes, then I do a few joins, a group, and a filter finally reinserting some data back via JDBC the results to a different table. All this executes just fine at Spark EMR in Amazon Web Services in a m3.xlarge with 2 cores in around a minute.

我的问题如下: 1.目前,我在集群上有1个主节点和2个核心,但是每次启动新步骤时,从历史记录服务器上可以看到,似乎只使用了1个执行程序,因为我看到列出了2个执行程序,一个完全没有使用的驱动程序,一个ID为1的执行程序,处理大约1410个任务.而且我完全不确定如何进行.

My question is the following: 1. right now I have 1 master and 2 cores on the cluster, but every time I launch a new step, It seems from what I can see from the history server, only 1 executor is being used as I can see 2 executors listed, driver with no usage at all, an an executor with id 1 processing around 1410 tasks. And I'm completely unsure on how to proceed.

这也是特定于AWS的,但是我不想发布2个问题,因为它们之间存在某种联系,有什么办法可以同时运行2个步骤?意味着能够同时运行此流程的两个spark-submit,因为我们一天要多次运行此流程(它处理客户端数据).我知道我可以通过该步骤启动新集群,但是我希望能够快速进行处理,而启动新集群花费的时间太长. 谢谢!!!

Also this is specific to AWS but I didn't want to post 2 questions as they are somehow related, is there any way I can run 2 steps at the same time? meaning to be able to have 2 spark-submits of this process running at the same time, as we run this process many many times a day ( it processes client data ). I know I can launch a new cluster with the step, but i want to be able to do the processing fast and just launching a new cluster takes too long. Thanks!!!

推荐答案

第一个问题:

我不确定是否是这种情况,但是类似的事情发生在我们身上,也许可以帮上忙.

I am not sure if this is the case, but something similar happened to us and maybe it can help.

如果使用sqlContext.read.format("jdbc").load()(或类似方法)从JDBC源中读取数据,则默认情况下不会对所得数据帧进行分区.因此,如果您是这种情况,则在不首先分区就对结果数据帧进行转换的情况下,只有一个执行程序能够对其进行处理.如果不是您的情况,则以下解决方案可能无法解决您的问题.

If you are reading from the JDBC source using sqlContext.read.format("jdbc").load() (or similar), by default the resulting dataframe is not partitioned. So, if it's the case for you, applying transformations in the resulting dataframe without partitioning it first would result in only one executor being able to process it. If it's not your case, the following solution will probably not solve your problem.

因此,我们的解决方案是在数据中创建一个数值列,其值的范围为1到32(我们期望的分区数),并通过设置jdbc阅读器的分区选项将其用作分区列(请检查此链接):

So, our solution was to create a numeric column with values values from 1 to 32 (our desired number of partitions) in the data and use it as partitioning column by setting the partitioning options of the jdbc reader (please check this link):

val connectionOptions = Map[String, String] (... <connection options> ...)
val options = connectionOptions ++ Map[String, String] (
    "partitionColumn" -> "column name", 
    "lowerBound" -> "1", 
    "upperBound" -> "32", 
    "numPartitions" -> "32"
)

val df = sqlContext.read.format("jdbc").options(options).load()

因此,使用这种方法,不仅可以并行处理读取任务(确实提高了性能并避免了OOM错误),而且还为所有后续转换对所得数据帧进行了分区和并行处理.

So, with this approach, not only was the reading task able to be processed in parallel (really improving the performance and avoiding OOM errors), but the resulting dataframe was partitioned and processed in parallel for all subsequent transformations.

我希望能帮上忙.

这篇关于为什么我的Spark App仅在1个执行器上运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆