使用numpy读取csv文件中的主要内存问题 [英] major memory problems reading in a csv file using numpy

查看:280
本文介绍了使用numpy读取csv文件中的主要内存问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从Kaggle抓取了KDD track1数据集,并决定在我的16GB高内存EC2实例上将一个〜2.5GB的3列CSV文件加载到内存中:

I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB 3-column CSV file into memory, on my 16GB high-memory EC2 instance:

data = np.loadtxt('rec_log_train.txt')

python会话耗尽了我所有的内存(100%),然后被杀死了.

the python session ate up all my memory (100%), and then got killed.

然后我使用R(通过read.table)读取同一文件,它使用的内存少于5GB,在我调用垃圾收集器后崩溃到不到2GB.

I then read the same file using R (via read.table) and it used less than 5GB of ram, which collapsed to less than 2GB after I called the garbage collector.

我的问题是,为什么在numpy下失败了?将文件读入内存的正确方法是什么?是的,我可以使用生成器来避免问题,但这不是目标.

My question is why did this fail under numpy, and what's the proper way of reading a file into memory. Yes I can use generators and avoid the problem, but that's not the goal.

推荐答案

import pandas, re, numpy as np

def load_file(filename, num_cols, delimiter='\t'):
    data = None
    try:
        data = np.load(filename + '.npy')
    except:
        splitter = re.compile(delimiter)

        def items(infile):
            for line in infile:
                for item in splitter.split(line):
                    yield item

        with open(filename, 'r') as infile:
            data = np.fromiter(items(infile), float64, -1)
            data = data.reshape((-1, num_cols))
            np.save(filename, data)

    return pandas.DataFrame(data)

这将读取2.5GB的文件,并对输出矩阵进行序列化.输入文件是懒惰地"读取的,因此不会构建任何中间数据结构,并且会使用最少的内存.初始加载需要很长时间,但是(序列化文件的)每个后续加载都很快.如果您有提示,请让我来!

This reads in the 2.5GB file, and serializes the output matrix. The input file is read in "lazily", so no intermediate data-structures are built and minimal memory is used. The initial load takes a long time, but each subsequent load (of the serialized file) is fast. Please let me if you have tips!

这篇关于使用numpy读取csv文件中的主要内存问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆