将数组或DataFrame以及其他信息保存在文件中 [英] Saving in a file an array or DataFrame together with other information
问题描述
统计软件 Stata 允许将短文本片段保存在数据集中.使用 notes
和/或
The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes
and/or characteristics
.
这对我来说是非常有价值的功能,因为它使我可以保存各种信息,从提醒和待办事项列表到有关我如何生成数据甚至特定变量的估算方法的信息曾是.
This is a feature of great value to me as it allows me to save a variety of information, ranging from reminders and to-do lists to information about how I generated the data, or even what the estimation method for a particular variable was.
我现在正在尝试在Python 3.6中提出类似的功能.到目前为止,我已经在网上查看并咨询了许多帖子,但是这些帖子并不能完全解决我想做的事情.
I am now trying to come up with a similar functionality in Python 3.6. So far, I have looked online and consulted a number of posts, which however do not exactly address what I want to do.
一些参考文章包括:
For a small NumPy
array, I have concluded that a combination of the function numpy.savez()
and a dictionary
can store adequately all relevant information in a single file.
例如:
a = np.array([[2,4],[6,8],[10,12]])
d = {"first": 1, "second": "two", "third": 3}
np.savez(whatever_name.npz, a=a, d=d)
data = np.load(whatever_name.npz)
arr = data['a']
dic = data['d'].tolist()
但是,问题仍然存在:
是否有更好的方法将其他信息潜在地合并到包含NumPy
数组或(大)Pandas
DataFrame
的文件中?
Are there better ways to potentially incorporate other pieces of information in a file containing a NumPy
array or a (large) Pandas
DataFrame
?
我特别想听听您可能对示例提出的任何建议的具体 pros 和 cons .依赖性越少越好.
I am particularly interested in hearing about the particular pros and cons of any suggestions you may have with examples. The fewer dependencies, the better.
推荐答案
有很多选择.我将只讨论HDF5,因为我有使用这种格式的经验.
There are many options. I will discuss only HDF5, because I have experience using this format.
优点:可移植(可以在Python之外读取),本机压缩,内存不足功能,元数据支持.
Advantages: Portable (can be read outside of Python), native compression, out-of-memory capabilities, metadata support.
缺点:依赖单个低级C API,可能会将数据损坏为单个文件,删除数据不会自动减小大小.
Disadvantages: Reliance on single low-level C API, possibility of data corruption as a single file, deleting data does not reduce size automatically.
根据我的经验,出于性能和可移植性的考虑,避免 pyTables
/HDFStore
存储数字数据.您可以改为使用 h5py
提供的直观界面.
In my experience, for performance and portability, avoid pyTables
/ HDFStore
to store numeric data. You can instead use the intuitive interface provided by h5py
.
存储阵列
import h5py, numpy as np
arr = np.random.randint(0, 10, (1000, 1000))
f = h5py.File('file.h5', 'w', libver='latest') # use 'latest' for performance
dset = f.create_dataset('array', shape=(1000, 1000), data=arr, chunks=(100, 100),
compression='gzip', compression_opts=9)
压缩和分块
有很多压缩选择,例如blosc
和lzf
分别是压缩和解压缩性能的不错选择.注意gzip
是本地的; HDF5安装默认不会附带其他压缩过滤器.
There are many compression choices, e.g. blosc
and lzf
are good choices for compression and decompression performance respectively. Note gzip
is native; other compression filters may not ship by default with your HDF5 installation.
Chunking是另一个选项,当与您从内存中读取数据的方式保持一致时,可以显着提高性能.
Chunking is another option which, when aligned with how you read data out-of-memory, can significantly improve performance.
添加一些属性
dset.attrs['Description'] = 'Some text snippet'
dset.attrs['RowIndexArray'] = np.arange(1000)
存储词典
for k, v in d.items():
f.create_dataset('dictgroup/'+str(k), data=v)
内存不足访问权限
dictionary = f['dictgroup']
res = dictionary['my_key']
阅读h5py
文档是无可替代的.大部分的C API,但是您应该从上面看到很大的灵活性.
There is no substitute for reading the h5py
documentation, which exposes most of the C API, but you should see from the above there is a significant amount of flexibility.
这篇关于将数组或DataFrame以及其他信息保存在文件中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!