HDFStore:table.select和RAM使用情况 [英] HDFStore: table.select and RAM usage

查看：130 发布时间：2020/5/24 0:15:02 python pandas pytables hdfstore

本文介绍了HDFStore:table.select和RAM使用情况的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从大约1 GB的HDFStore表中选择随机行.当我要求随机访问约50行时，RAM使用量激增.

I am trying to select random rows from a HDFStore table of about 1 GB. RAM usage explodes when I ask for about 50 random rows.

我正在使用熊猫0-11-dev, python 2.7, linux64.

在第一种情况下，RAM使用量适合chunk

In this first case the RAM usage fits the size of chunk

with pd.get_store("train.h5",'r') as train:
for chunk in train.select('train',chunksize=50):
    pass

在第二种情况下，似乎整个表都已加载到RAM

In this second case, it seems like the whole table is loaded into RAM

r=random.choice(400000,size=40,replace=False)
train.select('train',pd.Term("index",r))

在最后一种情况下，RAM使用量适合等效的chunk大小

In this last case, RAM usage fits the equivalent chunk size

r=random.choice(400000,size=30,replace=False)    
train.select('train',pd.Term("index",r))

我很困惑，为什么从30个随机行移动到40个行会导致RAM使用量急剧增加.

I am puzzled, why moving from 30 to 40 random rows induces such a dramatic increase in RAM usage.

请注意，使用以下代码创建表时已对其进行了索引编制，以使index = range(nrows(table)):

Note the table has been indexed when created such that index=range(nrows(table)) using the following code:

def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
    max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)

    with pd.get_store( storefile,'w') as store:
        for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
            chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
            store.append(table_name,chunk, min_itemsize={'values':max_len})

感谢见识

编辑答复Zelazny7

这是我用来将Train.csv编写到train.h5的文件.我使用如何排除HDFStore异常故障:找不到正确的原子类型

Here's the file I used to write Train.csv to train.h5. I wrote this using elements of Zelazny7's code from How to trouble-shoot HDFStore Exception: cannot find the correct atom type

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer


def object_max_len(x):
    if x.dtype != 'object':
        return
    else:
        return len(max(x.fillna(''), key=lambda x: len(str(x))))

def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ):
    max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max()
    dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes

    for chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize):
        max_len = max((pd.DataFrame(chunk.apply( object_max_len)).max(),max_len))
        for i,k in enumerate(zip( dtypes0[:], chunk.dtypes)):
            if (k[0] != k[1]) and (k[1] == 'object'):
                dtypes0[i] = k[1]
    #as of pandas-0.11 nan requires a float64 dtype
    dtypes0.values[dtypes0 == np.int64] = np.dtype('float64')
    return max_len, dtypes0


def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
    max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)

    with pd.get_store( storefile,'w') as store:
        for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
            chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
            store.append(table_name,chunk, min_itemsize={'values':max_len})

应用为

txtfile2hdfstore('Train.csv','train.h5','train',sep=',')

HDFStore:table.select和RAM使用情况 [英] HDFStore: table.select and RAM usage

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

HDFStore:table.select和RAM使用情况 [英] HDFStore: table.select and RAM usage

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭