带有大.dta文件的 pandas read_stata() [英] Pandas read_stata() with large .dta files

查看:131
本文介绍了带有大.dta文件的 pandas read_stata()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个大约3.3 GB的Stata .dta文件,因此它很大,但又不过分.我对使用IPython感兴趣,并尝试使用Pandas导入.dta文件,但是有些奇怪的事情正在发生.我的盒子有32 GB的RAM,尝试加载.dta文件会导致所有RAM被使用(约30分钟后),并且我的计算机无法运行.这不是感觉"正确的,因为我可以使用外部软件包中的read.dta()在R中打开文件,没有问题,并且可以在Stata中使用该文件.我正在使用的代码是:

I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:

%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')

,我正在Enthought的Canopy程序中使用IPython.之所以使用%time",是因为我有兴趣将其与R的read.dta()进行基准比较.

and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().

我的问题是:

  1. 我做错了什么导致熊猫出现问题吗?
  2. 是否有变通方法将数据放入Pandas数据框中?

推荐答案

这里有一个对我来说很方便的小功能,它使用了一些pandas功能,这些功能在最初提出问题时可能还不可用:

Here is a little function that has been handy for me, using some pandas features that might not have been available when the question was originally posed:

def load_large_dta(fname):
    import sys

    reader = pd.read_stata(fname, iterator=True)
    df = pd.DataFrame()

    try:
        chunk = reader.get_chunk(100*1000)
        while len(chunk) > 0:
            df = df.append(chunk, ignore_index=True)
            chunk = reader.get_chunk(100*1000)
            print '.',
            sys.stdout.flush()
    except (StopIteration, KeyboardInterrupt):
        pass

    print '\nloaded {} rows'.format(len(df))

    return df

我在100分钟内以此加载了一个11G Stata文件,如果我厌倦了等待并按下cntl-c,很高兴可以玩一些东西.

I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c.

此笔记本将其显示在行动中.

这篇关于带有大.dta文件的 pandas read_stata()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆