pandas 数据框的最大大小 [英] Maximum size of pandas dataframe

查看:43
本文介绍了 pandas 数据框的最大大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用panda s read_csvread_stata函数读取稍大的数据集,但我一直遇到Memory Error s.数据框的最大大小是多少?我的理解是,只要数据适合内存,数据帧就应该可以,这对我来说不是问题.还有什么可能导致内存错误?

对于上下文,我正在尝试阅读《 2007年消费者金融调查》 ,既可以是ASCII格式(使用read_csv),也可以是Stata格式(使用read_stata).该文件的dta大小约为200MB,而ASCII的大小约为1.2GB,在Stata中打开该文件将告诉我,对于22,000个观察值/行,有5,800个变量/列.

解决方案

我将发布此答案,如评论中所述.我已经看到它多次出现而没有被接受的答案.

内存错误很直观-内存不足.但是有时,由于您有足够的内存,因此对该错误的解决方案或调试令人沮丧,但该错误仍然存​​在.

1)检查代码错误

这可能是一个愚蠢的步骤",但这就是为什么它是第一个原因.确保没有无限循环或已知会花费很长时间的事情(例如使用os模块的东西来搜索您的整个计算机并将输出保存在excel文件中)

2)提高代码效率

遵循步骤1的步骤.但是,如果简单的事情花费很长时间,通常会有一个模块或更好的方法来完成更快,更高效的内存.这就是Python和/或开源语言的美!

3)检查对象的总内存

第一步是检查对象的内存.关于此,Stack上有很多线程,因此您可以搜索它们.常见的答案是此处和< 此处

要查找对象的大小,可以始终使用 sys.getsizeof() :

import sys
print(sys.getsizeof(OBEJCT_NAME_HERE))

现在,错误可能会在创建任何东西之前发生,但是如果您逐块读取csv,则可以看到每个块正在使用多少内存.

4)在运行时检查内存

有时您有足够的内存,但是正在运行的功能在运行时会消耗大量内存.这会导致内存峰值超过完成对象的实际大小,从而导致代码/过程出错.实时检查内存很长,但是可以完成. Ipython很好.检查其文档.

使用以下代码直接在Jupyter Notebook中查看文档:

%mprun?
%memit?

样品使用:

%load_ext memory_profiler
def lol(x):
    return x
%memit lol(500)
#output --- peak memory: 48.31 MiB, increment: 0.00 MiB

如果您需要魔术功能方面的帮助Survey of Consumer Finances 2007, both in ASCII format (using read_csv) and in Stata format (using read_stata). The file is around 200MB as dta and around 1.2GB as ASCII, and opening it in Stata tells me that there are 5,800 variables/columns for 22,000 observations/rows.

解决方案

I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.

The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.

1) Check for code errors

This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os module that will search your entire computer and put the output in an excel file)

2) Make your code more efficient

Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!

3) Check The Total Memory of the object

The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here

to find the size of an object in bites you can always use sys.getsizeof():

import sys
print(sys.getsizeof(OBEJCT_NAME_HERE))

Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.

4) Check the memory while running

Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.

use the code below to see the documentation straight in Jupyter Notebook:

%mprun?
%memit?

Sample use:

%load_ext memory_profiler
def lol(x):
    return x
%memit lol(500)
#output --- peak memory: 48.31 MiB, increment: 0.00 MiB

If you need help on magic functions This is a great post

5) This one may be first.... but Check for simple things like bit version

As in your case, a simple switching of the version of python you were running solved the issue.

Usually the above steps solve my issues.

这篇关于 pandas 数据框的最大大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆