从Pyspark数据帧创建字典,显示OutOfMemoryError:Java堆空间 [英] Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

查看:56
本文介绍了从Pyspark数据帧创建字典,显示OutOfMemoryError:Java堆空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到并尝试了许多现有的关于此问题的StackOverflow帖子,但无济于事.我猜我的JAVA堆空间没有我的大型数据集那么大,我的数据集包含650万行.我的Linux实例包含具有4个内核的64GB Ram .根据此建议,我需要修复我的代码,但我认为从pyspark dataframe制作字典应该不是很昂贵.如果有其他计算方法,请告知我.

I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores. As per this suggestion I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.

我只想从pyspark数据框中创建一个python字典,这是我pyspark数据框中的内容,

I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,

property_sql_df.show()显示,

+--------------+------------+--------------------+--------------------+
|            id|country_code|       name|          hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
|  BOND-9129450|          US|Scotron Home w/Ga...|90cb0946cf4139e12...|
|  BOND-1742850|          US|Sited in the Mead...|d5c301f00e9966483...|
|  BOND-3211356|          US|NEW LISTING - Com...|811fa26e240d726ec...|
|  BOND-7630290|          US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
|  BOND-7175508|          US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+

我想要做的是用hash_of_cc_pn_li作为和id作为列表值的字典.

What I want is to make a dictionary with hash_of_cc_pn_li as key and id as a list value.

预期产量

{
  "90cb0946cf4139e12": ["BOND-9129450", "BOND-7175508"]
  "d5c301f00e9966483": ["BOND-1742850","BOND-7630290"]
}

到目前为止我尝试过的事情

%%time
duplicate_property_list = {}
for ind in property_sql_df.collect(): 
     hashed_value = ind.hash_of_cc_pn_li
     property_id = ind.id
     if hashed_value in duplicate_property_list:
         duplicate_property_list[hashed_value].append(property_id) 
     else:
         duplicate_property_list[hashed_value] = [property_id] 

我现在在控制台上得到的内容:

java.lang.OutOfMemoryError:Java堆空间

java.lang.OutOfMemoryError: Java heap space

,并在 Jupyter笔记本输出

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)

推荐答案

从链接的帖子中添加接受的答案以供后代使用.答案通过利用 write.json 方法并在此处防止向驱动程序收集太大的数据集来解决该问题:

Adding accepted answer from linked post for posterity. The answer solves the problem by leveraging write.json method and preventing the collection of too-large dataset to the Driver here:

https://stackoverflow.com/a/63111765/12378881

这篇关于从Pyspark数据帧创建字典,显示OutOfMemoryError:Java堆空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆