Spark 任务时长差异 [英] Spark task duration difference

查看:29
本文介绍了Spark 任务时长差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行将数据 (.csv) 从 s3 加载到 DataFrame 中的应用程序,然后将这些 Dataframe 注册为临时表.之后,我使用 SparkSQL 加入这些表,最后将结果写入 db.目前对我来说瓶颈的问题是我觉得任务没有平均分配,我没有得到任何好处或并行化和集群内的多个节点.更准确地说,这是问题阶段的任务持续时间分布任务时长分布有没有办法让我执行更均衡的分配?也许手动编写 map/reduce 函数?不幸的是,这个阶段还有 6 个任务仍在运行(1.7 小时 atm),这将证明更大的偏差.

I'm running application that loads data (.csv) from s3 into DataFrames, and than register those Dataframes as temp tables. After that, I use SparkSQL to join those tables and finally write result into db. Issue that is currently bottleneck for me is that I feel tasks are not evenly split and i get no benefits or parallelization and multiple nodes inside cluster. More precisely, this is distribution of task duration in problematic stage task duration distribution Is there way for me to enforce more balanced distribution ? Maybe manually writing map/reduce functions ? Unfortunately, this stage has 6 more tasks that are still running (1.7 hours atm), which will prove even greater deviation.

推荐答案

有两种可能:一种是在你的控制之下,.. 不幸的是一种可能..

There are two likely possibilities: one is under your control and .. unfortunately one is likely not ..

  • 倾斜的数据.检查分区的大小是否相对相似 - 例如在三或四倍之内.
  • Spark 任务运行时的固有可变性.我在 Spark Standalone、Yarn 和 Mesos 上看到了大量延迟的行为,而没有明显的原因.症状是:
    • 在托管落后任务的节点上很少或没有发生 CPU 或磁盘活动的延长时间(分钟)
    • 数据大小与落后者没有明显的相关性
    • 不同的节点/worker 可能会遇到相同作业后续运行的延迟
    • Skewed data. Check that the partitions are of relatively similar size - say within a factor of three or four.
    • Inherent variability of Spark tasks runtime. I have seen behavior of large delays in stragglers on Spark Standalone, Yarn, and Mesos without an apparent reason. The symptoms are:
      • extended periods (minutes) where little or no cpu or disk activity were occurring on the nodes hosting the straggler tasks
      • no apparent correlation of data size to the stragglers
      • different nodes/workers may experience the delays on subsequent runs of the same job

      要检查的一件事:执行 hdfs dfsadmin -reporthdfs fsck 以查看 hdfs 是否健康.

      One thing to check: do hdfs dfsadmin -report and hdfs fsck to see if hdfs were healthy.

      这篇关于Spark 任务时长差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆