在大 pandas 数据框中替换NaN值时遇到Python内存错误 [英] Python Memory Error encountered when replacing NaN values in large Pandas dataframe

查看:50
本文介绍了在大 pandas 数据框中替换NaN值时遇到Python内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的熊猫数据框:〜300,000列和〜17,520行。大熊猫数据框称为 result_full 。我试图将所有字符串 NaN 替换为 numpy.nan

I have a very large pandas dataframe: ~300,000 columns and ~17,520 rows. The pandas dataframe is called result_full. I am attempting to replace all of the strings "NaN" with numpy.nan:

result_full.replace(["NaN"], np.nan, inplace = True)

在这里我得到 MemoryError 有没有一种有效的内存方式来将这些字符串放在数据框中?我尝试了 result_full.dropna(),但是它没有用,因为从技术上讲它们是字符串 NaN

Here is where I get MemoryError Is there a memory efficient way to drop these strings in my dataframe? I tried result_full.dropna() but it didn't work because they are technically string that are "NaN"

推荐答案

问题之一可能是由于使用32位计算机,因为它一次最多可以处理2GB的数据。如果可能的话,可以扩展到64位计算机,以避免将来出现问题。

One of the issues could be because of using a 32-bit Machine as it can process a maximum of 2GB of data at a time. If possible, scale up to a 64-bit machine to avoid problems in the future.

与此同时,可能会有黑客入侵。使用 df.to_csv()选项将数据框转换为CSV。完成后,如果您在df.read_csv()的文档/stable/generate/pandas.read_csv.html rel = nofollow noreferrer> read_csv的熊猫文档,您会注意到该参数

Meanwhile, there could be a hack to this. Convert the dataframe to CSV by using the df.to_csv() option. Once that's done, if you look into the documentation of the df.read_csv() in the pandas documentation of read_csv, you shall notice this parameter

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific   per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’`.

因此,它将字符串 NaN识别为np.nan,您的问题将得到解决。

So,it shall recognize the string 'NaN' as np.nan and your problem shall be solved.

同时,如果直接通过CSV创建此数据帧,则可以使用此参数来避免内存问题。希望能帮助到你。
干杯!

Meanwhile, if you are directly creating this Dataframe through a CSV, you could use this parameter to avoid the memory problem. Hope it helps. Cheers!

这篇关于在大 pandas 数据框中替换NaN值时遇到Python内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆