PySpark:java.lang.OutofMemoryError:Java 堆空间 [英] PySpark: java.lang.OutofMemoryError: Java heap space

查看:25
本文介绍了PySpark:java.lang.OutofMemoryError:Java 堆空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在我的服务器上使用 PySpark 和 Ipython,它有 24 个 CPU 和 32GB RAM.它只在一台机器上运行.在我的过程中,我想收集以下代码中给出的大量数据:

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code:

train_dataRDD = (train.map(lambda x:getTagsAndText(x))
.filter(lambda x:x[-1]!=[])
.flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
.groupByKey()
.mapValues(list))

当我这样做

training_data =  train_dataRDD.collectAsMap()

它给了我 outOfMemory 错误.Java 堆空间.此外,在出现此错误后,我无法对 Spark 执行任何操作,因为它与 Java 失去了连接.它给出了 Py4JNetworkError: Cannot connect to the java server.

It gives me outOfMemory Error. Java heap Space. Also, I can not perform any operations on Spark after this error as it looses connection with Java. It gives Py4JNetworkError: Cannot connect to the java server.

看起来堆空间很小.如何将其设置为更大的限制?

It looks like heap space is small. How can I set it to bigger limits?

编辑:

我在跑步前尝试过的事情:sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')

Things that I tried before running: sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')

我根据此处的文档更改了 spark 选项(如果您执行 ctrl-f 并搜索 spark.executor.extraJavaOptions):http://spark.apache.org/docs/1.2.1/configuration.html

I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html

它说我可以通过设置 spark.executor.memory 选项来避免 OOM.我做了同样的事情,但似乎不起作用.

It says that I can avoid OOMs by setting spark.executor.memory option. I did the same thing but it seem not be working.

推荐答案

在尝试了大量配置参数后,我发现只有一个需要更改以启用更多 Heap 空间,即 spark.driver.记忆.

After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and i.e. spark.driver.memory.

sudo vim $SPARK_HOME/conf/spark-defaults.conf
#uncomment the spark.driver.memory and change it according to your use. I changed it to below
spark.driver.memory 15g
# press : and then wq! to exit vim editor

关闭您现有的 Spark 应用程序并重新运行它.您不会再次遇到此错误.:)

Close your existing spark application and re run it. You will not encounter this error again. :)

这篇关于PySpark:java.lang.OutofMemoryError:Java 堆空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆