Python 中用于数据科学的标准、稳定的文件格式是什么? [英] What are the standard, stable file formats used in Python for Data Science?
问题描述
我经常想快速保存一些 Python 数据,但我也想将其保存为稳定的文件格式,以防日期长时间徘徊.所以我有一个问题,我该如何保存我的数据?
I often want to quickly save some Python data, but I would also like to save it in a stable file format in case the date lingers for a long time. And so I have the question, how can I save my data?
在数据科学中,我想存储三种数据——任意 Python 对象、numpy 数组和 Pandas 数据帧.-- 存储这些的稳定方式是什么?
In data science, there are three kinds of data I want to store -- arbitrary Python objects, numpy arrays, and Pandas dataframes. -- what are the stable ways of storing these?
推荐答案
任意Python数据和代码可以以.pkl
pickle格式存储强>.虽然pickle 文件存在安全问题,因为加载它们可以执行任意代码,但如果您可以信任pickle 文件的来源,那么它就是一种稳定的格式.
Arbitrary Python data and code can be stored in the .pkl
pickle format. While pickle files have security concerns because loading them can execute arbitrary code, if you can trust the source of a pickle file, it is a stable format.
pickle 序列化格式保证在 Python 版本之间向后兼容,前提是选择了兼容的 pickle 协议,并且如果您的数据跨越了独特的突破性更改语言边界,pickle 和 unpickling 代码会处理 Python 2 到 Python 3 的类型差异.
The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.
大多数 python 数据也可以存储在 json 中格式.我自己很少使用这种格式,但 dawg 推荐它.就像我为 Pandas 推荐的 CSV 和制表符分隔格式一样,json 格式是一种非常稳定的纯文本格式.
Most python data can also be stored in the json format. I haven't used this format much myself, but dawg recommends it. Like the CSV and tab-delimited format I recommend for Pandas, the json format is a plain-text format that is very stable.
Numpy 数组 可以是 存储在.npy
或 .npz
numpy 格式.npy 格式是一种非常简单的格式,用于存储单个 numpy 数组.我想用任何语言阅读这种格式都会很容易.npz 格式允许在同一个文件中存储多个数组.改编自文档、
Numpy arrays can be stored in the .npy
or .npz
numpy formats. The npy format is a very simple format that stores a single numpy array. I imagine it would be easy to read this format in any language. The npz format allows the storing of multiple arrays in the same file. Adapted from the docs,
x = np.arange(10)
np.save('example.npy',x)
y = np.load('example.npy')
如果不能保证加载文件的完整性,请务必使用allow_pickle=False
以避免任意代码执行.
If the integrity of the file being loaded is not guaranteed, be sure to use allow_pickle=False
to avoid arbitrary code execution.
Pandas 数据帧可以以多种格式存储.正如 我在之前的回答中所写,Pandas 提供 多种格式.对于小型数据集,我发现纯文本文件格式(例如 CSV 和制表符分隔)适用于大多数用途.这些格式可以用多种语言读取,而且我在双语 R 和 Python 环境中工作时没有遇到任何问题,两种环境都从这些文件中读取.
Pandas dataframes can be stored in a variety of formats. As I wrote in a previous answer, Pandas offers a wide variety of formats. For small datasets, I find plaintext file formats such as CSV and tab-delimited to work well for most purposes. These formats are readable in a wide variety of languages and I have had no issues in working in a bilingual R and Python environment where both environments read from these files.
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq
当从 Pandas 写入 csv 和 tab 文件时,我经常使用 index=False
选项来避免保存索引,默认情况下它会作为一个奇怪命名的列加载.
When writing csv and tab files from pandas, I often use the index=False
option to avoid saving the index, which loads as an oddly-named column by default.
这篇关于Python 中用于数据科学的标准、稳定的文件格式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!