在大 pandas 数据框中替换NaN值时遇到Python内存错误 [英] Python Memory Error encountered when replacing NaN values in large Pandas dataframe
问题描述
我有一个很大的熊猫数据框:〜300,000列和〜17,520行。大熊猫数据框称为 result_full
。我试图将所有字符串 NaN
替换为 numpy.nan
:
I have a very large pandas dataframe: ~300,000 columns and ~17,520 rows. The pandas dataframe is called result_full
. I am attempting to replace all of the strings "NaN"
with numpy.nan
:
result_full.replace(["NaN"], np.nan, inplace = True)
在这里我得到 MemoryError
有没有一种有效的内存方式来将这些字符串放在数据框中?我尝试了 result_full.dropna()
,但是它没有用,因为从技术上讲它们是字符串 NaN
Here is where I get MemoryError
Is there a memory efficient way to drop these strings in my dataframe? I tried result_full.dropna()
but it didn't work because they are technically string that are "NaN"
推荐答案
问题之一可能是由于使用32位计算机,因为它一次最多可以处理2GB的数据。如果可能的话,可以扩展到64位计算机,以避免将来出现问题。
One of the issues could be because of using a 32-bit Machine as it can process a maximum of 2GB of data at a time. If possible, scale up to a 64-bit machine to avoid problems in the future.
与此同时,可能会有黑客入侵。使用 df.to_csv()
选项将数据框转换为CSV。完成后,如果您在
Meanwhile, there could be a hack to this. Convert the dataframe to CSV by using the df.to_csv()
option. Once that's done, if you look into the documentation of the df.read_csv()
in the pandas documentation of read_csv, you shall notice this parameter
na_values : scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’`.
因此,它将字符串 NaN识别为np.nan,您的问题将得到解决。
So,it shall recognize the string 'NaN' as np.nan and your problem shall be solved.
同时,如果直接通过CSV创建此数据帧,则可以使用此参数来避免内存问题。希望能帮助到你。
干杯!
Meanwhile, if you are directly creating this Dataframe through a CSV, you could use this parameter to avoid the memory problem. Hope it helps. Cheers!
这篇关于在大 pandas 数据框中替换NaN值时遇到Python内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!