如何处理"WARN TaskSetManager:阶段包含非常大的任务"? [英] What to do with "WARN TaskSetManager: Stage contains a task of very large size"?

查看:1945
本文介绍了如何处理"WARN TaskSetManager:阶段包含非常大的任务"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用spark 1.6.1.

我的spark应用程序读取s3中存储的10000个以上实木复合地板文件.

val df = sqlContext.read.option("mergeSchema", "true").parquet(myPaths: _*)

myPathsArray[String],其中包含10000个实木复合地板文件的路径.每个路径都是这样的s3n://bucketname/blahblah.parquet

Spark警告消息如下.

WARN TaskSetManager:阶段4包含一个非常大的任务 (108KB).建议的最大任务大小为100KB.

Spark设法设法运行并完成了这项工作,但我想这可能会减慢火花处理工作的速度.

有人对此问题有很好的建议吗?

解决方案

问题是您的数据集在分区之间分布不均匀,因此某些分区的数据比其他分区更多(因此某些任务会计算出更大的结果).

默认情况下,Spark SQL使用spark.sql.shuffle.partitions属性假定200个分区(请参见其他配置选项):

spark.sql.shuffle.partitions (默认值:200)配置在对联接或聚集的数据进行混排时要使用的分区数.

一种解决方案是在读取镶木地板文件之后(执行操作之前)coalescerepartition您的数据集.

使用explain或网络用户界面查看执行计划.


该警告会提示您优化查询,以便使用更有效的结果获取(请参见

myPaths is an Array[String] that contains the paths of the 10000 parquet files. Each path is like this s3n://bucketname/blahblah.parquet

Spark warns message like below.

WARN TaskSetManager: Stage 4 contains a task of very large size (108KB). The maximum recommended task size is 100KB.

Spark has managed to run and finish the job anyway but I guess this can slow down spark processing job.

Does anybody has a good suggestion about this problem?

解决方案

The issue is that your dataset is not evenly distributed across partitions and hence some partitions have more data than others (and so some tasks compute larger results).

By default Spark SQL assumes 200 partitions using spark.sql.shuffle.partitions property (see Other Configuration Options):

spark.sql.shuffle.partitions (default: 200) Configures the number of partitions to use when shuffling data for joins or aggregations.

A solution is to coalesce or repartition your Dataset after you've read parquet files (and before executing an action).

Use explain or web UI to review execution plans.


The warning gives you a hint to optimize your query so the more effective result fetch is used (see TaskSetManager).

With the warning TaskScheduler (that runs on the driver) will fetch the result values using the less effective approach IndirectTaskResult (as you can see in the code).

这篇关于如何处理"WARN TaskSetManager:阶段包含非常大的任务"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆