Spark MLlib - trainImplicit 警告 [英] Spark MLlib - trainImplicit warning

查看:31
本文介绍了Spark MLlib - trainImplicit 警告的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 trainImplicit 时不断看到这些警告:

I keep seeing these warnings when using trainImplicit:

WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB).
The maximum recommended task size is 100 KB.

然后任务大小开始增加.我试图在输入 RDD 上调用 repartition 但警告是一样的.

And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same.

所有这些警告都来自 ALS 迭代、flatMap 和聚合,例如 flatMap 显示这些警告的阶段的起源(使用 Spark 1.3.0,但它们也在 Spark 1.3.1 中显示):

All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also shown in Spark 1.3.1):

org.apache.spark.rdd.RDD.flatMap(RDD.scala:296)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1065)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:530)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)

和来自聚合:

org.apache.spark.rdd.RDD.aggregate(RDD.scala:968)
org.apache.spark.ml.recommendation.ALS$.computeYtY(ALS.scala:1112)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1064)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:538)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)

推荐答案

Apache Spark 邮件列表中描述了类似的问题 - http://apache-spark-user-list.1001560.n3.nabble.com/Large-Task-Size-td9539.html

Similar problem was described in Apache Spark mail lists - http://apache-spark-user-list.1001560.n3.nabble.com/Large-Task-Size-td9539.html

我认为您可以尝试使用多个分区(使用 repartition() 方法),具体取决于您拥有多少主机、RAM、CPU.

I think you can try to play with number of partitions (using repartition() method), depending of how many hosts, RAM, CPUs do you have.

还尝试通过 Web UI 调查所有步骤,您可以在其中查看阶段数、每个阶段的内存使用情况和数据位置.

Try also to investigate all steps via Web UI, where you can see number of stages, memory usage on each stage, and data locality.

或者只是不要在意这些警告,除非一切正常且快速.

Or just never mind about this warnings unless everything works correctly and fast.

此通知在 Spark (scheduler/TaskSetManager.scala) 中硬编码

This notification is hard-coded in Spark (scheduler/TaskSetManager.scala)

      if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
          !emittedTaskSizeWarning) {
        emittedTaskSizeWarning = true
        logWarning(s"Stage ${task.stageId} contains a task of very large size " +
          s"(${serializedTask.limit / 1024} KB). The maximum recommended task size is " +
          s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
      }

.

private[spark] object TaskSetManager {
  // The user will be warned if any stages contain a task that has a serialized size greater than
  // this.
  val TASK_SIZE_TO_WARN_KB = 100
} 

这篇关于Spark MLlib - trainImplicit 警告的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆