Spark 任务时长差异 [英] Spark task duration difference

查看：29 发布时间：2021/11/14 23:07:24 apache-spark scheduled-tasks apache-spark-sql

本文介绍了Spark 任务时长差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在运行将数据 (.csv) 从 s3 加载到 DataFrame 中的应用程序，然后将这些 Dataframe 注册为临时表.之后，我使用 SparkSQL 加入这些表，最后将结果写入 db.目前对我来说瓶颈的问题是我觉得任务没有平均分配，我没有得到任何好处或并行化和集群内的多个节点.更准确地说，这是问题阶段的任务持续时间分布任务时长分布有没有办法让我执行更均衡的分配?也许手动编写 map/reduce 函数?不幸的是，这个阶段还有 6 个任务仍在运行(1.7 小时 atm)，这将证明更大的偏差.

I'm running application that loads data (.csv) from s3 into DataFrames, and than register those Dataframes as temp tables. After that, I use SparkSQL to join those tables and finally write result into db. Issue that is currently bottleneck for me is that I feel tasks are not evenly split and i get no benefits or parallelization and multiple nodes inside cluster. More precisely, this is distribution of task duration in problematic stage task duration distribution Is there way for me to enforce more balanced distribution ? Maybe manually writing map/reduce functions ? Unfortunately, this stage has 6 more tasks that are still running (1.7 hours atm), which will prove even greater deviation.

Spark 任务时长差异 [英] Spark task duration difference

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 任务时长差异 [英] Spark task duration difference

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭