如何使用 pandas 存储数据框 [英] How to store a dataframe using Pandas
问题描述
现在,每次运行脚本时,我都将一个相当大的CSV
作为数据框导入.是否有一个很好的解决方案,可以使数据帧在两次运行之间保持持续可用,因此我不必花所有时间等待脚本运行?
最简单的方法是 to_pickle
>
df.to_pickle(file_name) # where to save it, usually as a .pkl
然后您可以使用以下方法将其加载回去:
df = pd.read_pickle(file_name)
注意:在0.11.1 save
和load
之前是这样做的唯一方法(现在已弃用它们,而分别推荐使用to_pickle
和read_pickle
). >
另一个流行的选择是使用 HDF5 ( pytables ),它提供了食谱.
从0.13开始,还有 msgpack 可能是可以更好地实现互操作性,作为JSON的更快替代品,或者具有python对象/大量文本数据(请参见此问题).
Right now I'm importing a fairly large CSV
as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?
The easiest way is to pickle it using to_pickle
:
df.to_pickle(file_name) # where to save it, usually as a .pkl
Then you can load it back using:
df = pd.read_pickle(file_name)
Note: before 0.11.1 save
and load
were the only way to do this (they are now deprecated in favor of to_pickle
and read_pickle
respectively).
Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:
store = HDFStore('store.h5')
store['df'] = df # save it
store['df'] # load it
More advanced strategies are discussed in the cookbook.
Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).
这篇关于如何使用 pandas 存储数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!