PySpark应用程序失败,并出现java.lang.OutOfMemoryError:Java堆空间 [英] PySpark application fail with java.lang.OutOfMemoryError: Java heap space

查看:67
本文介绍了PySpark应用程序失败,并出现java.lang.OutOfMemoryError:Java堆空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过pycharm和pyspark shell运行spark.我堆积了这个错误:

I'm running spark via pycharm and respectively pyspark shell. I've stacked with this error:

: java.lang.OutOfMemoryError: Java heap space
    at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:416)
    at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:748)

我的代码是:

from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
import time

if __name__ == '__main__':

    print("Started at " + time.strftime("%H:%M:%S"))

    conf = (SparkConf()
            .setAppName("TestRdd") \
            .set('spark.driver.cores', '1') \
            .set('spark.executor.cores', '1') \
            .set('spark.driver.memory', '16G') \
            .set('spark.executor.memory', '9G'))
    sc = SparkContext(conf=conf)

    rdd = sc.parallelize(range(1000000000),100)

    print(rdd.take(10))

    print("Finished at " + time.strftime("%H:%M:%S"))

这些是最大内存设置,我可以在群集上设置.我试图将所有内存分配给1个核心以创建rdd.但是在我看来,应用程序在分发数据集之前失败了.我假设创建步骤失败.我也尝试设置各种分区100-10000.我已经计算出需要多少内存,因此有10亿个整数-大约4.5-4.7Gb的内存,比我少,但没有运气.

These are max memory settings,I can set on the cluster. I tried to allocate all memory to 1 core for creating rdd. But seems to me that application fails before distributing dataset. It fails on creating step I assume. Also I tried to set various number of partitions 100-10000. I've calculated how much memory it would take, so 1Billion of int - aproximately 4.5-4.7Gb in memory, less then I have, but no luck.

如何优化并强制运行我的代码?

How can I optimize and force to run my code?

推荐答案

TL; DR 不要在外部测试和简单实验中使用 parallelize .因为您使用的是Python 2.7,所以 range 并不懒惰,因此您将实现多种类型的值的全部范围:

TL;DR Don't use parallelize outside tests and simple experiments. Because you use Python 2.7, range is not lazy, so you'll materialize a full range of values multiple types:

    调用后
  • Python list .
  • 序列化的版本,以后将其写入磁盘.
  • 在JVM上加载的序列化副本.

使用 xrange 会有所帮助,但首先不应该使用 parallelize (或2018年的Python 2).

Using xrange would help, but you shouldn't use parallelize in the first place (or Python 2 in 2018).

如果要创建一系列值,请使用

If you want to create a series of values just use SparkContext.range

范围(开始,结束=无,步骤= 1,numSlices =无)

创建一个新的int RDD,其中包含从头到尾(不包括元素)的元素,并逐步增加每个元素.可以像python的内置range()函数一样调用.如果使用单个参数调用,则该参数将解释为end,并将start设置为0.

Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. Can be called the same way as python’s built-in range() function. If called with a single argument, the argument is interpreted as end, and start is set to 0.

所以在您的情况下:

rdd = sc.range(1000000000, numSlices=100)

使用 DataFrame :

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.range(1000000000, numPartitions=100)

这篇关于PySpark应用程序失败,并出现java.lang.OutOfMemoryError:Java堆空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆