使用pandas.to_hdf快速读取df中的指定列 [英] Fast reading of specified columns in df using pandas.to_hdf

查看:591
本文介绍了使用pandas.to_hdf快速读取df中的指定列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个2Gb的数据帧,只能写入一次,读取很多df. 我想在熊猫中使用df,因此我以固定格式使用df.read_hdfdf.to_hdf,因此在读写方面效果很好.

I have a dataframe of 2Gb that is a write once, read many df. I would like to use the df in pandas, therefore I was using df.read_hdf and df.to_hdf in a fixed format which works pretty fine in reading and writing.

但是,df越来越大,增加了更多列,因此我想改用表格式,这样我就可以选择读取数据时需要的列.我以为这会给我带来速度上的优势,但是从测试来看似乎并非如此.

However, the df is growing with more columns being added, so I would like to use the table format instead, so I can select the columns I need when reading the data. I thought this would give me a speed advantage, but from testing this doesn't seem to be the case.

此示例:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10000000,9),columns=list('ABCDEFGHI'))
%time df.to_hdf("temp.h5", "temp", format ="fixed", mode="w")
%time df.to_hdf("temp2.h5", "temp2", format="table", mode="w")

显示固定格式稍快(我的机器上为6.8秒vs 5.9秒).

shows fixed format is slightly faster (6.8s vs 5.9 seconds on my machine).

然后读取数据(稍作休息以确保文件已完全保存):

Then reading the data (after a small break to make sure file has been fully saved):

%time x = pd.read_hdf("temp.h5", "temp")
%time y = pd.read_hdf("temp2.h5", "temp2")
%time z = pd.read_hdf("temp2.h5", "temp2", columns=list("ABC"))

收益率:

Wall time: 420 ms (fixed)   
Wall time: 557 ms (format)   
Wall time: 671 ms (format, specified columns)

我确实知道固定格式可以更快地读取数据,但是为什么 具有指定列的df比读取整个数据帧慢?与固定格式相比,使用表格格式(有或没有指定列)有什么好处?

I do understand the fixed format is faster in reading the data, but why is the df with specified columns slower than reading the full dataframe? What is the benefit of using table formatting (with or without specified columns) over fixed formatting?

当df变得更大时,是否有存储优势?

Is there maybe a memory advantage when the df is growing even bigger?

推荐答案

IMO将format='table'data_columns=[list_of_indexed_columns]结合使用的主要优点是能够有条件地(请参阅where="where clause"参数)读取大型HDF5文件.这样您就可以在读取时过滤数据并分块处理数据以避免MemoryError.

IMO the main advantage of using format='table' in conjunction with data_columns=[list_of_indexed_columns] is the ability to conditionally (see where="where clause" parameter) read huge HDF5 files. So that you can filter your data while reading and process your data in chunks to avoid MemoryError.

您可以尝试将单个列或列组(大多数情况下会一起读取)保存在不同的HDF文件中,或使用不同的键保存在同一文件中.

You can try to save single columns or column groups (those that most of the time will be read together) in different HDF files or in the same file with different keys.

我还考虑使用尖端"技术-羽毛格式

I'd also consider using "cutting-edge" technology - Feather-Format

测试和计时:

import feather

以三种格式写入磁盘:(固定的HDF5,HDF%表,羽毛)

writing to disk in three formats: (HDF5 fixed, HDF% table, Feather)

df = pd.DataFrame(np.random.randn(10000000,9),columns=list('ABCDEFGHI'))
df.to_hdf('c:/temp/fixed.h5', 'temp', format='f', mode='w')
df.to_hdf('c:/temp/tab.h5', 'temp', format='t', mode='w')
feather.write_dataframe(df, 'c:/temp/df.feather')

从磁盘读取:

In [122]: %timeit pd.read_hdf(r'C:\Temp\fixed.h5', "temp")
1 loop, best of 3: 409 ms per loop

In [123]: %timeit pd.read_hdf(r'C:\Temp\tab.h5', "temp")
1 loop, best of 3: 558 ms per loop

In [124]: %timeit pd.read_hdf(r'C:\Temp\tab.h5', "temp", columns=list('BDF'))
The slowest run took 4.60 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 689 ms per loop

In [125]: %timeit feather.read_dataframe('c:/temp/df.feather')
The slowest run took 6.92 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 644 ms per loop

In [126]: %timeit feather.read_dataframe('c:/temp/df.feather', columns=list('BDF'))
1 loop, best of 3: 218 ms per loop  # WINNER !!!

PS,如果在使用feather.write_dataframe(...)时遇到以下错误:

PS if you encounter the following error when using feather.write_dataframe(...):

FeatherError: Invalid: no support for strided data yet 

这是一种解决方法:

df = df.copy()

之后feather.write_dataframe(df, path)应该可以正常工作...

after that feather.write_dataframe(df, path) should work properly...

这篇关于使用pandas.to_hdf快速读取df中的指定列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆