如何处理"WARN TaskSetManager:阶段包含非常大的任务"? [英] What to do with "WARN TaskSetManager: Stage contains a task of very large size"?
问题描述
我使用spark 1.6.1.
我的spark应用程序读取s3中存储的10000个以上实木复合地板文件.
val df = sqlContext.read.option("mergeSchema", "true").parquet(myPaths: _*)
myPaths
是Array[String]
,其中包含10000个实木复合地板文件的路径.每个路径都是这样的s3n://bucketname/blahblah.parquet
Spark警告消息如下.
WARN TaskSetManager:阶段4包含一个非常大的任务 (108KB).建议的最大任务大小为100KB.
Spark设法设法运行并完成了这项工作,但我想这可能会减慢火花处理工作的速度.
有人对此问题有很好的建议吗?
问题是您的数据集在分区之间分布不均匀,因此某些分区的数据比其他分区更多(因此某些任务会计算出更大的结果).>
默认情况下,Spark SQL使用spark.sql.shuffle.partitions
属性假定200个分区(请参见其他配置选项):
spark.sql.shuffle.partitions (默认值:200)配置在对联接或聚集的数据进行混排时要使用的分区数.
一种解决方案是在读取镶木地板文件之后(执行操作之前)coalesce
或repartition
您的数据集.
使用explain
或网络用户界面查看执行计划.
该警告会提示您优化查询,以便使用更有效的结果获取(请参见代码).
I use spark 1.6.1.
My spark application reads more than 10000 parquet files stored in s3.
val df = sqlContext.read.option("mergeSchema", "true").parquet(myPaths: _*)
myPaths
is an Array[String]
that contains the paths of the 10000 parquet files. Each path is like this s3n://bucketname/blahblah.parquet
Spark warns message like below.
WARN TaskSetManager: Stage 4 contains a task of very large size (108KB). The maximum recommended task size is 100KB.
Spark has managed to run and finish the job anyway but I guess this can slow down spark processing job.
Does anybody has a good suggestion about this problem?
The issue is that your dataset is not evenly distributed across partitions and hence some partitions have more data than others (and so some tasks compute larger results).
By default Spark SQL assumes 200 partitions using spark.sql.shuffle.partitions
property (see Other Configuration Options):
spark.sql.shuffle.partitions (default: 200) Configures the number of partitions to use when shuffling data for joins or aggregations.
A solution is to coalesce
or repartition
your Dataset after you've read parquet files (and before executing an action).
Use explain
or web UI to review execution plans.
The warning gives you a hint to optimize your query so the more effective result fetch is used (see TaskSetManager).
With the warning TaskScheduler (that runs on the driver) will fetch the result values using the less effective approach IndirectTaskResult
(as you can see in the code).
这篇关于如何处理"WARN TaskSetManager:阶段包含非常大的任务"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!