Spark java.lang.OutOfMemoryError:Java堆空间 [英] Spark java.lang.OutOfMemoryError : Java Heap space

查看:203
本文介绍了Spark java.lang.OutOfMemoryError:Java堆空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用spark运行模型训练管道时出现上述错误

I am geting the above error when i run a model training pipeline with spark

`val inputData = spark.read
  .option("header", true)
  .option("mode","DROPMALFORMED")
  .csv(input)
  .repartition(500)
  .toDF("b", "c")
  .withColumn("b", lower(col("b")))
  .withColumn("c", lower(col("c")))
  .toDF("b", "c")
  .na.drop()`

inputData 大约有2500万行,大小约为2gb.模型建立阶段就这样发生

inputData has about 25 million rows and is about 2gb in size. the model building phase happens like so

val tokenizer = new Tokenizer()
  .setInputCol("c")
  .setOutputCol("tokens")

val cvSpec = new CountVectorizer()
  .setInputCol("tokens")
  .setOutputCol("features")
  .setMinDF(minDF)
  .setVocabSize(vocabSize)

val nb = new NaiveBayes()
  .setLabelCol("bi")
  .setFeaturesCol("features")
  .setPredictionCol("prediction")
  .setSmoothing(smoothing)

new Pipeline().setStages(Array(tokenizer, cvSpec, nb)).fit(inputData)

我正在使用以下命令在具有16gb RAM的计算机中本地运行上述spark作业

I am running the above spark jobs locally in a machine with 16gb RAM using the following command

spark-submit --class holmes.model.building.ModelBuilder ./holmes-model-building/target/scala-2.11/holmes-model-building_2.11-1.0.0-SNAPSHOT-7d6978.jar --master local[*] --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=2g --conf spark.rpc.message.maxSize=1024 --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=50g --driver-memory=12g

oom错误是由(在堆栈跟踪的底部)触发的 通过org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:706)

The oom error is triggered by (at the bottow of the stack trace) by org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:706)

日志:

Caused by: java.lang.OutOfMemoryError: Java heap space at java.lang.reflect.Array.newInstance(Array.java:75) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1897) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:706) 

任何建议都会很棒:)

推荐答案

我会尝试的事情:

1)卸下spark.memory.offHeap.enabled=true,并将驱动程序内存增加到包装盒上可用内存的90%.您可能意识到了这一点,因为您没有设置执行程序内存,但是在本地模式下,驱动程序和执行程序都在由驱动程序内存控制的同一进程中运行.我还没有尝试过,但是offHeap功能听起来价值有限. 参考

1) Removing spark.memory.offHeap.enabled=true and increasing driver memory to something like 90% of the available memory on the box. You probably are aware of this since you didn't set executor memory, but in local mode the driver and the executor all run in the same process which is controlled by driver-memory. I haven't tried it, but the offHeap feature sounds like it has limited value. Reference

2)实际集群而不是本地模式.显然,更多的节点将为您提供更多的RAM.

2) An actual cluster instead of local mode. More nodes will obviously give you more RAM.

3a)如果要坚持使用本地模式,请尝试使用更少的内核.您可以通过指定要在主设置(如--master local[4])中使用的内核数来代替使用所有内核的local[*]来完成此操作.以更少的线程运行同时处理数据将在任何给定时间导致RAM中的数据更少.

3a) If you want to stick with local mode, try using less cores. You can do this by specifying the number of cores to use in the master setting like --master local[4] instead of local[*] which uses all of them. Running with less threads simultaneously processing data will lead to less data in RAM at any given time.

3b)如果移至群集,则出于与上述相同的原因,您可能还想调整执行程序核心的数量.您可以使用--executor-cores标志来完成此操作.

3b) If you move to a cluster, you may also want to tweak the number of executors cores for the same reason as mentioned above. You can do this with the --executor-cores flag.

4)尝试使用更多分区.在示例代码中,您已重新分区为500个分区,也许尝试1000或2000?分区越多,意味着每个分区越小,内存压力越小.

4) Try with more partitions. In your example code you repartitioned to 500 partitions, maybe try 1000, or 2000? More partitions means each partition is smaller and less memory pressure.

这篇关于Spark java.lang.OutOfMemoryError:Java堆空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆