64位系统,8GB的RAM,超过800MB的CSV以及使用python读取会导致内存错误 [英] 64 bit system, 8gb of ram, a bit more than 800MB of CSV and reading with python gives memory error
问题描述
f = open("data.csv")
f.seek(0)
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)
以上是我用来读取csv文件的代码. csv文件只有大约800 MB,我正在使用 64位系统和 8GB Ram.该文件包含1亿行.但是,更不用说读取整个文件了,即使读取前1000万行也给我一个'MemoryError:'<-这实际上是整个错误消息.
The above is the code I am using to read a csv file. The csv file is only about 800 MB and I am using a 64 bit system with 8GB of Ram. The file contains 100 million lines. However,not to mention to read the entire file, even reading the first 10 million lines gives me a 'MemoryError:" <- this is really the entire error message.
有人可以告诉我为什么吗?另外还有一个问题,有人可以告诉我如何朗读吗,比如说2000万行吗?我知道我需要使用f.seek(某个数字),但是由于我的数据是一个csv文件,所以我不知道应该将哪个数字准确地放入f.seek()中,以便它从第20行开始准确读取.
Could someone tell me why please? Also as a side question, could someone tell me how to read from, say the 20th million row please? I know I need to use f.seek(some number) but since my data is a csv file I dont know which number I should put exactly into f.seek() so that it reads exactly from 20th row.
非常感谢您.
推荐答案
有人可以告诉我如何朗读吗,比如说2000万行吗?我知道我需要使用f.seek(一些数字)
could someone tell me how to read from, say the 20th million row please? I know I need to use f.seek(some number)
否,在这种情况下,您不能(也不能)使用f.seek()
.相反,您必须以某种方式读取前两千万行中的每一行.
No, you can't (and mustn't) use f.seek()
in this situation. Rather, you must read each of the first 20 million rows somehow.
Python文档具有以下食谱:
def consume(iterator, n):
"Advance the iterator n-steps ahead. If n is none, consume entirely."
# Use functions that consume iterators at C speed.
if n is None:
# feed the entire iterator into a zero-length deque
collections.deque(iterator, maxlen=0)
else:
# advance to the empty slice starting at position n
next(islice(iterator, n, n), None)
使用它,您将因此在20,000,000行之后开始:
Using that, you would start after 20,000,000 rows thusly:
#UNTESTED
f = open("data.csv")
f_reader = csv.reader(f)
consume(f_reader, 20000000)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)
或者这可能会更快:
#UNTESTED
f = open("data.csv")
consume(f, 20000000)
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)
这篇关于64位系统,8GB的RAM,超过800MB的CSV以及使用python读取会导致内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!