当使用"pandas.read_hdf()"读取巨大的HDF5文件时,即使我通过指定块大小读取了块,为什么仍然仍然出现MemoryError? [英] When reading huge HDF5 file with "pandas.read_hdf() ", why do I still get MemoryError even though I read in chunks by specifying chunksize?

查看:1116
本文介绍了当使用"pandas.read_hdf()"读取巨大的HDF5文件时,即使我通过指定块大小读取了块,为什么仍然仍然出现MemoryError?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用python pandas读取了一些大的CSV文件并将其存储在HDF5文件中,生成的HDF5文件约为10GB. 读回该问题时会发生.即使我尝试将其分批读取,我仍然会收到MemoryError.

I use python pandas to read a few large CSV file and store it in HDF5 file, the resulting HDF5 file is about 10GB. The problem happens when reading it back. Even though I tried to read it back in chunks, I still get MemoryError.

import glob, os
import pandas as pd

hdf = pd.HDFStore('raw_sample_storage2.h5')

os.chdir("C:/RawDataCollection/raw_samples/PLB_Gate")
for filename in glob.glob("RD_*.txt"):
    raw_df = pd.read_csv(filename,
                         sep=' ',
                         header=None, 
                         names=['time', 'GW_time', 'node_id', 'X', 'Y', 'Z', 'status', 'seq', 'rssi', 'lqi'], 
                         dtype={'GW_time': uint32, 'node_id': uint8, 'X': uint16, 'Y': uint16, 'Z':uint16, 'status': uint8, 'seq': uint8, 'rssi': int8, 'lqi': uint8},
                         parse_dates=['time'], 
                         date_parser=dateparse, 
                         chunksize=50000, 
                         skip_blank_lines=True)
    for chunk in raw_df:
        hdf.append('raw_sample_all', chunk, format='table', data_columns = True, index = True, compression='blosc', complevel=9)

这是我尝试分块阅读的方法:

for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', chunksize=300000):
    print(df.head(1))

这是我收到的错误消息:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-7-ef278566a16b> in <module>()
----> 1 for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', chunksize=300000):
      2     print(df.head(1))

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in read_hdf(path_or_buf, key, **kwargs)
    321         store = HDFStore(path_or_buf, **kwargs)
    322         try:
--> 323             return f(store, True)
    324         except:
    325 

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in <lambda>(store, auto_close)
    303 
    304     f = lambda store, auto_close: store.select(
--> 305         key, auto_close=auto_close, **kwargs)
    306 
    307     if isinstance(path_or_buf, string_types):

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    663                            auto_close=auto_close)
    664 
--> 665         return it.get_result()
    666 
    667     def select_as_coordinates(

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in get_result(self, coordinates)
   1346                     "can only use an iterator or chunksize on a table")
   1347 
-> 1348             self.coordinates = self.s.read_coordinates(where=self.where)
   1349 
   1350             return self

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in read_coordinates(self, where, start, stop, **kwargs)
   3545         self.selection = Selection(
   3546             self, where=where, start=start, stop=stop, **kwargs)
-> 3547         coords = self.selection.select_coords()
   3548         if self.selection.filter is not None:
   3549             for field, op, filt in self.selection.filter.format():

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in select_coords(self)
   4507             return self.coordinates
   4508 
-> 4509         return np.arange(start, stop)
   4510 
   4511 # utilities ###

MemoryError: 

我的python环境:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: x86
processor: x86 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.15.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None

在执行read_hdf()之后,大约需要花费半个小时的时间来发生MemoryError,与此同时,我检查了taskmgr,CPU活动很少,使用的总内存从未超过2.2G.在执行代码之前大约是2.1 GB.因此,装入RAM的任何熊猫read_hdf()都小于100 MB (我有4G RAM,而我的32位Windows系统只能使用2.7G,其余的用于RAM磁盘)

Edit 1:

It took about half an hour for the MemoryError to happen after executing read_hdf(), and in the meanwhile I checked taskmgr, and there's little CPU activity and total memory used never exceeded 2.2G. It was about 2.1 GB before I execute the code. So whatever pandas read_hdf() loaded into the RAM is less than 100 MB (I have 4G RAM, and my 32-bit-Windows system can only use 2.7G, and I used the rest for RAM disk)

以下是hdf文件信息:

In [2]:
hdf = pd.HDFStore('raw_sample_storage2.h5')
hdf

Out[2]:
<class 'pandas.io.pytables.HDFStore'>
File path: C:/RawDataCollection/raw_samples/PLB_Gate/raw_sample_storage2.h5
/raw_sample_all            frame_table  (typ->appendable,nrows->308581091,ncols->10,indexers->[index],dc->[time,GW_time,node_id,X,Y,Z,status,seq,rssi,lqi])

此外,我可以通过指示开始"和停止"而不是块大小"来读取hdf文件的一部分:

%%time
df = pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000)
print df.info()
print(df.head(5))

执行只花了4秒钟,输出为:

The execution only took 4 seconds, and the output is:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300000 entries, 0 to 49999
Data columns (total 10 columns):
time       300000 non-null datetime64[ns]
GW_time    300000 non-null uint32
node_id    300000 non-null uint8
X          300000 non-null uint16
Y          300000 non-null uint16
Z          300000 non-null uint16
status     300000 non-null uint8
seq        300000 non-null uint8
rssi       300000 non-null int8
lqi        300000 non-null uint8
dtypes: datetime64[ns](1), int8(1), uint16(3), uint32(1), uint8(4)
memory usage: 8.9 MB
None
                 time   GW_time  node_id      X      Y      Z  status  seq  \
0 2013-10-22 17:20:58  39821761        3  20010  21716  22668       0   33   
1 2013-10-22 17:20:58  39821824        4  19654  19647  19241       0   33   
2 2013-10-22 17:20:58  39821888        1  16927  21438  22722       0   34   
3 2013-10-22 17:20:58  39821952        2  17420  22882  20440       0   34   
4 2013-10-22 17:20:58  39822017        3  20010  21716  22668       0   34   

   rssi  lqi  
0   -43   49  
1   -72   47  
2   -46   48  
3   -57   46  
4   -42   50  
Wall time: 4.26 s

删除300000行仅占用8.9 MB RAM,我尝试将chunksize与start和stop一起使用:

Noticing 300000 rows only took 8.9 MB RAM, I tried to use chunksize together with start and stop:

for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000,chunksize = 3000):
    print df.info()
    print(df.head(5))

发生相同的MemoryError.

Same MemoryError happens.

我不明白这里发生了什么,如果内部机制以某种方式忽略了块大小/启动/停止并试图将整个内容加载到RAM中,那么当MemoryError发生时,RAM使用率几乎没有增加(仅100 MB) ?为何执行过程只花半个小时就可以在过程开始时就出现错误,而又没有明显的CPU使用率?

I don't understand what's happening here, if the internal mechanism somehow ignore chunksize/start/stop and tried to load the whole thing into RAM, how come there's almost no increase in RAM usage (only 100 MB) when MemoryError happens? And why does the execution take half an hour just to reach the error at the very beginning of the process without noticeable CPU usage?

推荐答案

因此,迭代器的构建主要是为了处理where子句. PyTables返回该子句为True的索引列表.这些是行号.在这种情况下,没有where子句,但是我们仍然使用索引器,在这种情况下,它只是行列表上的np.arange.

So the iterator is built mainly to deal with a where clause. PyTables returns a list of the indicies where the clause is True. These are row numbers. In this case, there is no where clause, but we still use the indexer, which in this case is simply np.arange on the list of rows.

300MM行占用2.2GB.对于32位Windows(通常最大容量约为1GB)而言,这实在太多了.在64位上这没问题.

300MM rows takes 2.2GB. which is too much for windows 32-bit (generally maxes out around 1GB). On 64-bit this would be no problem.

In [1]: np.arange(0,300000000).nbytes/(1024*1024*1024.0)
Out[1]: 2.2351741790771484

因此,这应该通过切片语义来处理,这将使其仅占用少量的内存.问题已在此处打开.

So this should be handled by slicing semantics, which would make this take only a trivial amount of memory. Issue opened here.

所以我建议这样做.这里的索引器是直接计算出来的,它提供了迭代器的语义.

So I would suggest this. Here the indexer is computed directly and this provides iterator semantics.

In [1]: df = DataFrame(np.random.randn(1000,2),columns=list('AB'))

In [2]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)

In [3]: store = pd.HDFStore('test.h5')

In [4]: nrows = store.get_storer('df').nrows

In [6]: chunksize = 100

In [7]: for i in xrange(nrows//chunksize + 1):
            chunk = store.select('df',
                                 start=i*chunksize,
                                 stop=(i+1)*chunksize)
            # work on the chunk    

In [8]: store.close()

这篇关于当使用"pandas.read_hdf()"读取巨大的HDF5文件时,即使我通过指定块大小读取了块,为什么仍然仍然出现MemoryError?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆