HDFStore:table.select和RAM使用情况 [英] HDFStore: table.select and RAM usage
问题描述
我正在尝试从大约1 GB的HDFStore表中选择随机行.当我要求随机访问约50行时,RAM使用量激增.
I am trying to select random rows from a HDFStore table of about 1 GB. RAM usage explodes when I ask for about 50 random rows.
我正在使用熊猫0-11-dev, python 2.7, linux64
.
在第一种情况下,RAM使用量适合chunk
In this first case the RAM usage fits the size of chunk
with pd.get_store("train.h5",'r') as train:
for chunk in train.select('train',chunksize=50):
pass
在第二种情况下,似乎整个表都已加载到RAM
In this second case, it seems like the whole table is loaded into RAM
r=random.choice(400000,size=40,replace=False)
train.select('train',pd.Term("index",r))
在最后一种情况下,RAM使用量适合等效的chunk
大小
In this last case, RAM usage fits the equivalent chunk
size
r=random.choice(400000,size=30,replace=False)
train.select('train',pd.Term("index",r))
我很困惑,为什么从30个随机行移动到40个行会导致RAM使用量急剧增加.
I am puzzled, why moving from 30 to 40 random rows induces such a dramatic increase in RAM usage.
请注意,使用以下代码创建表时已对其进行了索引编制,以使index = range(nrows(table)):
Note the table has been indexed when created such that index=range(nrows(table)) using the following code:
def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)
with pd.get_store( storefile,'w') as store:
for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
store.append(table_name,chunk, min_itemsize={'values':max_len})
感谢见识
编辑答复Zelazny7
这是我用来将Train.csv编写到train.h5的文件.我使用如何排除HDFStore异常故障:找不到正确的原子类型
Here's the file I used to write Train.csv to train.h5. I wrote this using elements of Zelazny7's code from How to trouble-shoot HDFStore Exception: cannot find the correct atom type
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
def object_max_len(x):
if x.dtype != 'object':
return
else:
return len(max(x.fillna(''), key=lambda x: len(str(x))))
def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ):
max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max()
dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes
for chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize):
max_len = max((pd.DataFrame(chunk.apply( object_max_len)).max(),max_len))
for i,k in enumerate(zip( dtypes0[:], chunk.dtypes)):
if (k[0] != k[1]) and (k[1] == 'object'):
dtypes0[i] = k[1]
#as of pandas-0.11 nan requires a float64 dtype
dtypes0.values[dtypes0 == np.int64] = np.dtype('float64')
return max_len, dtypes0
def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)
with pd.get_store( storefile,'w') as store:
for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
store.append(table_name,chunk, min_itemsize={'values':max_len})
应用为
txtfile2hdfstore('Train.csv','train.h5','train',sep=',')
推荐答案
这是一个已知问题,请参见此处的参考: https://github.com/pydata/pandas/pull/2755
This is a known issue, see the reference here: https://github.com/pydata/pandas/pull/2755
基本上,查询将转换为numexpr
表达式以进行评估.有个问题
我无法将很多or
条件传递给numexpr(取决于条件的总长度
生成的表达式).
Essentially the query is turned into a numexpr
expression for evaluation. There is an issue
where I can't pass a lot of or
conditions to numexpr (its dependent on the total length of the
generated expression).
所以我只限制了我们传递给numexpr的表达式.如果超过一定数量的or
条件,则查询将作为过滤器而不是内核中的选择来完成.基本上,这意味着先读取表,然后重新建立索引.
So I just limit the expression that we pass to numexpr. If it exceeds a certain number of or
conditions, then the query is done as a filter, rather than an in-kernel selection. Basically this means the table is read and then reindexed.
这是我的增强功能列表: https://github.com/pydata/pandas/issues /2391 (17).
This is on my enhancements list: https://github.com/pydata/pandas/issues/2391 (17).
作为一种解决方法,只需将查询分成多个查询并合并结果即可.应该更快,并使用恒定的内存量
As a workaround, just split your queries up into multiple ones and concat the results. Should be much faster, and use a constant amount of memory
这篇关于HDFStore:table.select和RAM使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!