优化/调整设置以激发作业,其中作业使用groupbyKey和reduceGroups [英] optimize/tune setting to spark job, where the job uses groupbyKey and reduceGroups

查看:52
本文介绍了优化/调整设置以激发作业,其中作业使用groupbyKey和reduceGroups的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试查看是否有诸如执行程序内存,内核,混搭分区之类的设置,或者我们可以想到的设置,这些设置可能会加快包括 union GroupByKey reduceGroups 操作

我了解要执行的这些繁琐操作,目前需要5个小时才能完成.

示例:

  .union(可传递).union(家庭).groupByKey(_.key).reduceGroups((left,right)=> 

火花提交

 "Step5_Spark_Command":"command-runner.jar,spark-submit,-class,com.ms.eng.link.modules.linkmod.Links,-name,\\\" Links \\\,-master,yarn,-deploy-mode,client,-executor-memory,32G,-executor-cores,4,-conf,spark.sql.shuffle.partitions = 2020,/home/hadoop/linking.jar,jobId =#{myJobId},environment = prod, 

功能

  val系列=generateFamilyLinks(references,superNodes.filter(_.linkType == FAMILY)).checkpoint(eager = true)直接的.union(倒数).union(可传递).union(家庭).groupByKey(_.key).reduceGroups((left,right)=>left.copy(seekingReferences = left.contributingReferences ++ right.contributingReferences,linkTypes = left.linkTypes ++ right.linkTypes,contexts = left.contexts ++ right.contexts)).map(group =>group._2.copy(seekingReferences = ArrayUtil.dedup(group._2.contributingReferences,_.key)) 

解决方案

通过查看您的spark-submit命令,我可以看到您正在YARN上运行Spark.但请问为什么在客户端模式下?实际上,在您的情况下,您的驱动程序和执行程序都仅在本地创建,而没有充分利用群集的资源.因此,请改用--deploy-mode集群.

对于硬件配置,请使用此链接./p>

希望,这有助于扩展您的应用程序.

Hi I am trying to see if there are any settings like executor memory, cores, shuffle partition or anything we can think of that might speed up a job which includes union,GroupByKey, and reduceGroups operations

I understand these intense operations to perform and its currently taking 5 hours to finish this.

example:

.union(transitive)
  .union(family)
  .groupByKey(_.key)
  .reduceGroups((left, right) =>

spark submit

"Step5_Spark_Command": "command-runner.jar,spark-submit,--class,com.ms.eng.link.modules.linkmod.Links,--name,\\\"Links\\\",--master,yarn,--deploy-mode,client,--executor-memory,32G,--executor-cores,4,--conf,spark.sql.shuffle.partitions=2020,/home/hadoop/linking.jar,jobId=#{myJobId},environment=prod",

The function

val family =
      generateFamilyLinks(references, superNodes.filter(_.linkType == FAMILY))
        .checkpoint(eager = true)
    direct
      .union(reciprocal)
      .union(transitive)
      .union(family)
      .groupByKey(_.key)
      .reduceGroups((left, right) =>
        left.copy(
          contributingReferences = left.contributingReferences ++ right.contributingReferences,
          linkTypes = left.linkTypes ++ right.linkTypes,
          contexts = left.contexts ++ right.contexts
        )
      )
      .map(group =>
        group._2.copy(
          contributingReferences = ArrayUtil.dedup(group._2.contributingReferences, _.key)
        )

解决方案

By looking into your spark-submit command, I can see you're running Spark on YARN. But may I ask why on client mode? Actually in your case your driver and executors are both being created in local only and not leveraging the resources of your cluster fully. So, use --deploy-mode cluster instead.

For hardware provisioning, use this link.

Hope, this helps in scaling your app.

这篇关于优化/调整设置以激发作业,其中作业使用groupbyKey和reduceGroups的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆