总分配超过堆内存的95.00％（960,285,889字节）-pyspark错误 [英] Total allocation exceeds 95.00% (960,285,889 bytes) of heap memory- pyspark error

查看：426 发布时间：2020/10/12 20:58:28 python csv pyspark heap parquet

本文介绍了总分配超过堆内存的95.00％（960,285,889字节）-pyspark错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在python 2.7中编写了一个脚本，该脚本使用pyspark将csv转换为镶木地板和其他东西。
当我在较小的数据上运行脚本时效果很好，但是在较大的数据（250GB）上运行脚本时，我迷上了以下错误-总分配超过了堆内存的95.00％（960,285,889字节）。
如何解决此问题？发生的原因是什么？
tnx！

I wrote a script in python 2.7 that using pyspark for converting csv to parquet and other stuff. when I ran my script on a small data it works well but when I did it on a bigger data (250GB) I crush on the following error- Total allocation exceeds 95.00% (960,285,889 bytes) of heap memory. How can I solve this problem? and what is the reason that it's happening? tnx!

部分代码：
导入的库：
导入pyspark as ps 从pyspark.sql.types导入StructType，StructField，IntegerType， DoubleType，StringType，TimestampType，LongType，FloatType 从集合中导入OrderedDict 从sys导入argv

使用pyspark：

 schema_table_name="schema_"+str(get_table_name())
 print (schema_table_name)
 schema_file= OrderedDict()

schema_list=[]
ddl_to_schema(data)
for i in schema_file:
schema_list.append(StructField(i,schema_file[i]()))

schema=StructType(schema_list)
print schema

spark = ps.sql.SparkSession.builder.getOrCreate()
df = spark.read.option("delimiter", 
",").format("csv").schema(schema).option("header", "false").load(argv[2])
df.write.parquet(argv[3])

# df.limit(1500).write.jdbc(url = url, table = get_table_name(), mode = 
  "append", properties = properties)
# df = spark.read.jdbc(url = url, table = get_table_name(), properties = 
  properties)
pq = spark.read.parquet(argv[3])
pq.show()

只是为了阐明schema_table_name

just to clarify the schema_table_name is meant to save all tables name ( that are in DDL that fit the csv).

function ddl_to_schema仅保存常规ddl并将其编辑为可用于拼花地板的ddl，用于保存所有表名（在DDL中适合csv的文件）。

function ddl_to_schema just take a regular ddl and edit it to a ddl that parquet can work with.

总分配超过堆内存的95.00％（960,285,889字节）-pyspark错误 [英] Total allocation exceeds 95.00% (960,285,889 bytes) of heap memory- pyspark error

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

总分配超过堆内存的95.00％（960,285,889字节）-pyspark错误 [英] Total allocation exceeds 95.00% (960,285,889 bytes) of heap memory- pyspark error

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭