Pandas.read_csv()内存错误 [英] Pandas.read_csv() MemoryError
问题描述
我有一个1GB的CSV文件.该文件有大约10000000(10 Mil)行.我需要遍历行以获取几个选定行的最大值(基于条件).问题是正在读取csv文件.
I have a 1gb csv file. The file has about 10000000(10 Mil) rows. I need to iterate through the rows to get the max of a few selected rows(based on a condition). The issue is reading the csv file.
我将Pandas软件包用于Python. read_csv()函数在读取csv文件时会引发MemoryError. 1)我试图将文件拆分为多个块并读取它们,现在,concat()函数出现了内存问题.
I use the Pandas package for Python. The read_csv() function throws the MemoryError while reading the csv file. 1) I have tried to split the file into chunks and read them, Now, the concat() function has a memory issue.
tp = pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float, 'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float, 'vdd_ext_flash_v': float, 'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float})
df = pd.concat(tp,ignore_index=True)
我已经使用dtype减少了内存消耗,但仍然没有改善.
I have used the dtype to reduce memory hog, still there is no improvement.
基于多个博客文章. 我已经将numpy和pandas更新为最新版本.还是没有运气.
Based on multiple blog posts. I have updated numpy, pandas all of them to the latest version. Still no luck.
如果有人可以解决此问题,那就太好了.
It would be great if anyone has a solution to this issue.
请注意:
-
我有64位操作系统(Windows 7)
I have a 64bit operating system(Windows 7)
我正在运行Python 2.7.10(默认值,2015年5月23日,09:40:32)[MSC v.1500 32位]
I am running Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit]
我有4GB Ram.
Numpy最新(pip安装程序显示已安装最新版本)
Numpy latest (pip installer says latest version installed)
Pandas Latest.(pip安装程序显示已安装最新版本)
Pandas Latest.(pip installer says latest version installed)
推荐答案
熊猫read_csv()具有较低的内存标志.
Pandas read_csv() has a low memory flag.
tp = pd.read_csv('capture2.csv',low_memory=True, ...)
low_memory标志仅在使用C解析器时可用
The low_memory flag is only available if you use the C parser
engine:{'c','python'},可选
engine : {‘c’, ‘python’}, optional
要使用的解析器引擎. C引擎速度更快,而python引擎当前功能更完善.
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
您还可以使用memory_map标志
You can also use the memory_map flag
memory_map:布尔值,默认为False
memory_map : boolean, default False
如果为filepath_or_buffer提供了文件路径,则将文件对象直接映射到内存中,然后直接从那里访问数据.使用此选项可以提高性能,因为不再有任何I/O开销.
If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
p.s.使用64位python-请参阅我的评论
p.s. use 64bit python - see my comment
这篇关于Pandas.read_csv()内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!