总分配超过堆内存的95.00%(960,285,889字节)-pyspark错误 [英] Total allocation exceeds 95.00% (960,285,889 bytes) of heap memory- pyspark error

查看:426
本文介绍了总分配超过堆内存的95.00%(960,285,889字节)-pyspark错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在python 2.7中编写了一个脚本,该脚本使用pyspark将csv转换为镶木地板和其他东西。
当我在较小的数据上运行脚本时效果很好,但是在较大的数据(250GB)上运行脚本时,我迷上了以下错误-总分配超过了堆内存的95.00%(960,285,889字节)。
如何解决此问题?发生的原因是什么?
tnx!

I wrote a script in python 2.7 that using pyspark for converting csv to parquet and other stuff. when I ran my script on a small data it works well but when I did it on a bigger data (250GB) I crush on the following error- Total allocation exceeds 95.00% (960,285,889 bytes) of heap memory. How can I solve this problem? and what is the reason that it's happening? tnx!

部分代码:
导入的库:
导入pyspark as ps
从pyspark.sql.types导入StructType,StructField,IntegerType,
DoubleType,StringType,TimestampType,LongType,FloatType
从集合中导入OrderedDict
从sys导入argv

使用pyspark:

 schema_table_name="schema_"+str(get_table_name())
 print (schema_table_name)
 schema_file= OrderedDict()

schema_list=[]
ddl_to_schema(data)
for i in schema_file:
schema_list.append(StructField(i,schema_file[i]()))

schema=StructType(schema_list)
print schema

spark = ps.sql.SparkSession.builder.getOrCreate()
df = spark.read.option("delimiter", 
",").format("csv").schema(schema).option("header", "false").load(argv[2])
df.write.parquet(argv[3])

# df.limit(1500).write.jdbc(url = url, table = get_table_name(), mode = 
  "append", properties = properties)
# df = spark.read.jdbc(url = url, table = get_table_name(), properties = 
  properties)
pq = spark.read.parquet(argv[3])
pq.show()

只是为了阐明schema_table_name

just to clarify the schema_table_name is meant to save all tables name ( that are in DDL that fit the csv).

function ddl_to_schema仅保存常规ddl并将其编辑为可用于拼花地板的ddl,用于保存所有表名(在DDL中适合csv的文件)。

function ddl_to_schema just take a regular ddl and edit it to a ddl that parquet can work with.

推荐答案

似乎您的驱动程序的内存不足。

It seems your driver is running out of memory.

默认情况下,驱动程序内存设置为1GB。由于您的程序使用了95%的程序,因此应用程序用完了内存。

By default the driver memory is set to 1GB. Since your program used 95% of it the application ran out of memory.

您可以尝试对其进行更改,直到达到满足以下需求的最佳位置为止。 m将其设置为2GB:

you can try to change it until you reach the "sweet spot" for your needs below I'm setting it to 2GB:

pyspark-驱动程序内存2g

您也可以使用执行程序的内存,尽管这里似乎不是问题(执行程序的默认值为4GB)。

you can play with the executor memory too, although it doesn't seem to be the problem here (the default value for the executor is 4GB).

pyspark --driver-memory 2g --executor-memory 8g

理论上,火花动作可以将数据卸载到驱动程序,导致内存不足如果尺寸不正确。对于您的情况,我无法确定,但似乎是造成此情况的原因。

the theory is, spark actions can offload data to the driver causing it to run out of memory if not properly sized. I can't tell for sure in your case, but it seems that the write is what is causing this.

您可以在此处了解相关理论(了解有关驱动程序,然后检查操作):

You can take a look at the theory here (read about driver program and then check the actions):

https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#actions

这篇关于总分配超过堆内存的95.00%(960,285,889字节)-pyspark错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆