读取一个巨大的.csv文件 [英] Reading a huge .csv file
问题描述
我目前正在尝试从Python 2.7中的.csv文件读取数据,最多有100万行和200列(文件范围从100mb到1.6gb)。我可以这样做(非常慢)对于300,000行以下的文件,但一旦我走上,我得到内存错误。我的代码如下所示:
I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:
def getdata(filename, criteria):
data=[]
for criterion in criteria:
data.append(getstuff(filename, criteron))
return data
def getstuff(filename, criterion):
import csv
data=[]
with open(filename, "rb") as csvfile:
datareader=csv.reader(csvfile)
for row in datareader:
if row[3]=="column header":
data.append(row)
elif len(data)<2 and row[3]!=criterion:
pass
elif row[3]==criterion:
data.append(row)
else:
return data
getstuff函数中的else子句的原因是所有符合条件的元素将一起列在csv文件中,所以我离开循环,当我经过它们以节省时间。
The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.
我的问题是:
How can I manage to get this to work with the bigger files?
有什么办法可以让它更快吗?
Is there any way I can make it faster?
我的电脑有8GB内存,运行64位Windows 7,处理器是3.40 GHz(不确定你需要什么信息)。
My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).
非常感谢任何帮助。
推荐答案
您正在将所有行读入列表, 。 不要这样做。
You are reading all rows into a list, then processing that list. Don't do that.
在生成行时处理您的行。如果您需要首先过滤数据,请使用生成函数:
Process your rows as you produce them. If you need to filter the data first, use a generator function:
import csv
def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
count = 0
for row in datareader:
if row[3] in ("column header", criterion):
yield row
count += 1
elif count < 2:
continue
else:
return
你的过滤器测试;逻辑是相同的但更简洁。
I also simplified your filter test; the logic is the same but more concise.
现在可以直接循环 getstuff()
。在 getdata()
中执行相同操作:
You can now loop over getstuff()
directly. Do the same in getdata()
:
def getdata(filename, criteria):
for criterion in criteria:
for row in getstuff(filename, criterion):
yield row
现在直接在代码中循环 getdata()
:
Now loop directly over getdata()
in your code:
for row in getdata(somefilename, sequence_of_criteria):
# process row
您现在只能在内存中持有一行,而不是按照每条标准的上千行。
You now only hold one row in memory, instead of your thousands of lines per criterion.
yield
使函数成为生成函数,这意味着它不会做任何工作,直到你开始循环。
yield
makes a function a generator function, which means it won't do any work until you start looping over it.
这篇关于读取一个巨大的.csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!