如何使用 pandas 存储数据框 [英] How to store a dataframe using Pandas

查看:70
本文介绍了如何使用 pandas 存储数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在,每次运行脚本时,我都将一个相当大的CSV作为数据框导入.是否有一个很好的解决方案,可以使数据帧在两次运行之间保持持续可用,因此我不必花所有时间等待脚本运行?

解决方案

最简单的方法是 to_pickle >

df.to_pickle(file_name)  # where to save it, usually as a .pkl

然后您可以使用以下方法将其加载回去:

df = pd.read_pickle(file_name)

注意:在0.11.1 saveload之前是这样做的唯一方法(现在已弃用它们,而分别推荐使用to_pickleread_pickle).


另一个流行的选择是使用 HDF5 ( pytables ),它提供了食谱.


从0.13开始,还有 msgpack 可能是可以更好地实现互操作性,作为JSON的更快替代品,或者具有python对象/大量文本数据(请参见此问题).

Right now I'm importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?

解决方案

The easiest way is to pickle it using to_pickle:

df.to_pickle(file_name)  # where to save it, usually as a .pkl

Then you can load it back using:

df = pd.read_pickle(file_name)

Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).


Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:

store = HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

More advanced strategies are discussed in the cookbook.


Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).

这篇关于如何使用 pandas 存储数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆