读取一个巨大的.csv文件 [英] Reading a huge .csv file

查看:310
本文介绍了读取一个巨大的.csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试从Python 2.7中的.csv文件读取数据,最多有100万行和200列(文件范围从100mb到1.6gb)。我可以这样做(非常慢)对于300,000行以下的文件,但一旦我走上,我得到内存错误。我的代码如下所示:

I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

    def getdata(filename, criteria):
        data=[]
        for criterion in criteria:
            data.append(getstuff(filename, criteron))
        return data

    def getstuff(filename, criterion):
        import csv
        data=[]
        with open(filename, "rb") as csvfile:
            datareader=csv.reader(csvfile)
            for row in datareader: 
                if row[3]=="column header":
                    data.append(row)
                elif len(data)<2 and row[3]!=criterion:
                    pass
                elif row[3]==criterion:
                    data.append(row)
                else:
                    return data

getstuff函数中的else子句的原因是所有符合条件的元素将一起列在csv文件中,所以我离开循环,当我经过它们以节省时间。

The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

我的问题是:


  1. How can I manage to get this to work with the bigger files?

有什么办法可以让它更快吗?

Is there any way I can make it faster?

我的电脑有8GB内存,运行64位Windows 7,处理器是3.40 GHz(不确定你需要什么信息)。

My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).

非常感谢任何帮助。

推荐答案

您正在将所有行读入列表, 。 不要这样做

You are reading all rows into a list, then processing that list. Don't do that.

在生成行时处理您的行。如果您需要首先过滤数据,请使用生成函数:

Process your rows as you produce them. If you need to filter the data first, use a generator function:

import csv

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        count = 0
        for row in datareader:
            if row[3] in ("column header", criterion):
                yield row
                count += 1
            elif count < 2:
                continue
            else:
                return

你的过滤器测试;逻辑是相同的但更简洁。

I also simplified your filter test; the logic is the same but more concise.

现在可以直接循环 getstuff()。在 getdata()中执行相同操作:

You can now loop over getstuff() directly. Do the same in getdata():

def getdata(filename, criteria):
    for criterion in criteria:
        for row in getstuff(filename, criterion):
            yield row

现在直接在代码中循环 getdata()

Now loop directly over getdata() in your code:

for row in getdata(somefilename, sequence_of_criteria):
    # process row

您现在只能在内存中持有一行,而不是按照每条标准的上千行。

You now only hold one row in memory, instead of your thousands of lines per criterion.

yield 使函数成为生成函数,这意味着它不会做任何工作,直到你开始循环。

yield makes a function a generator function, which means it won't do any work until you start looping over it.

这篇关于读取一个巨大的.csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆