64位系统,8GB的RAM,超过800MB的CSV以及使用python读取会导致内存错误 [英] 64 bit system, 8gb of ram, a bit more than 800MB of CSV and reading with python gives memory error

查看:157
本文介绍了64位系统,8GB的RAM,超过800MB的CSV以及使用python读取会导致内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

f = open("data.csv")
f.seek(0) 
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

以上是我用来读取csv文件的代码. csv文件只有大约800 MB,我正在使用 64位系统和 8GB Ram.该文件包含1亿行.但是,更不用说读取整个文件了,即使读取前1000万行也给我一个'MemoryError:'<-这实际上是整个错误消息.

The above is the code I am using to read a csv file. The csv file is only about 800 MB and I am using a 64 bit system with 8GB of Ram. The file contains 100 million lines. However,not to mention to read the entire file, even reading the first 10 million lines gives me a 'MemoryError:" <- this is really the entire error message.

有人可以告诉我为什么吗?另外还有一个问题,有人可以告诉我如何朗读吗,比如说2000万行吗?我知道我需要使用f.seek(某个数字),但是由于我的数据是一个csv文件,所以我不知道应该将哪个数字准确地放入f.seek()中,以便它从第20行开始准确读取.

Could someone tell me why please? Also as a side question, could someone tell me how to read from, say the 20th million row please? I know I need to use f.seek(some number) but since my data is a csv file I dont know which number I should put exactly into f.seek() so that it reads exactly from 20th row.

非常感谢您.

推荐答案

有人可以告诉我如何朗读吗,比如说2000万行吗?我知道我需要使用f.seek(一些数字)

could someone tell me how to read from, say the 20th million row please? I know I need to use f.seek(some number)

否,在这种情况下,您不能(也不能)使用f.seek().相反,您必须以某种方式读取前两千万行中的每一行.

No, you can't (and mustn't) use f.seek() in this situation. Rather, you must read each of the first 20 million rows somehow.

Python文档具有以下食谱:

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

使用它,您将因此在20,000,000行之后开始:

Using that, you would start after 20,000,000 rows thusly:

#UNTESTED
f = open("data.csv")
f_reader = csv.reader(f)
consume(f_reader, 20000000)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

或者这可能会更快:

#UNTESTED
f = open("data.csv")
consume(f, 20000000)
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

这篇关于64位系统,8GB的RAM,超过800MB的CSV以及使用python读取会导致内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆