Python 中用于数据科学的标准、稳定的文件格式是什么? [英] What are the standard, stable file formats used in Python for Data Science?

查看:57
本文介绍了Python 中用于数据科学的标准、稳定的文件格式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常想快速保存一些 Python 数据,但我也想将其保存为稳定的文件格式,以防日期长时间徘徊.所以我有一个问题,我该如何保存我的数据?

I often want to quickly save some Python data, but I would also like to save it in a stable file format in case the date lingers for a long time. And so I have the question, how can I save my data?

在数据科学中,我想存储三种数据——任意 Python 对象、numpy 数组和 Pandas 数据帧.-- 存储这些的稳定方式是什么?

In data science, there are three kinds of data I want to store -- arbitrary Python objects, numpy arrays, and Pandas dataframes. -- what are the stable ways of storing these?

推荐答案

任意Python数据和代码可以以.pklpickle格式存储强>.虽然pickle 文件存在安全问题,因为加载它们可以执行任意代码,但如果您可以信任pickle 文件的来源,那么它就是一种稳定的格式.

Arbitrary Python data and code can be stored in the .pkl pickle format. While pickle files have security concerns because loading them can execute arbitrary code, if you can trust the source of a pickle file, it is a stable format.

Python 标准库的pickle 页面:

pickle 序列化格式保证在 Python 版本之间向后兼容,前提是选择了兼容的 pickle 协议,并且如果您的数据跨越了独特的突破性更改语言边界,pickle 和 unpickling 代码会处理 Python 2 到 Python 3 的类型差异.

The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.

大多数 python 数据也可以存储在 json 中格式.我自己很少使用这种格式,但 dawg 推荐它.就像我为 Pandas 推荐的 CSV 和制表符分隔格式一样,json 格式是一种非常稳定的纯文本格式.

Most python data can also be stored in the json format. I haven't used this format much myself, but dawg recommends it. Like the CSV and tab-delimited format I recommend for Pandas, the json format is a plain-text format that is very stable.

Numpy 数组 可以是 存储.npy.npz numpy 格式.npy 格式是一种非常简单的格式,用于存储单个 numpy 数组.我想用任何语言阅读这种格式都会很容易.npz 格式允许在同一个文件中存储多个数组.改编自文档

Numpy arrays can be stored in the .npy or .npz numpy formats. The npy format is a very simple format that stores a single numpy array. I imagine it would be easy to read this format in any language. The npz format allows the storing of multiple arrays in the same file. Adapted from the docs,

x = np.arange(10)
np.save('example.npy',x)
y = np.load('example.npy') 

如果不能保证加载文件的完整性,请务必使用allow_pickle=False以避免任意代码执行.

If the integrity of the file being loaded is not guaranteed, be sure to use allow_pickle=False to avoid arbitrary code execution.

Pandas 数据帧可以以多种格式存储.正如 我在之前的回答中所写,Pandas 提供 多种格式.对于小型数据集,我发现纯文本文件格式(例如 CSV 和制表符分隔)适用于大多数用途.这些格式可以用多种语言读取,而且我在双语 R 和 Python 环境中工作时没有遇到任何问题,两种环境都从这些文件中读取.

Pandas dataframes can be stored in a variety of formats. As I wrote in a previous answer, Pandas offers a wide variety of formats. For small datasets, I find plaintext file formats such as CSV and tab-delimited to work well for most purposes. These formats are readable in a wide variety of languages and I have had no issues in working in a bilingual R and Python environment where both environments read from these files.

Format Type Data Description     Reader         Writer
text        CSV                  read_csv       to_csv
text        JSON                 read_json      to_json
text        HTML                 read_html      to_html
text        Local clipboard      read_clipboard to_clipboard
binary      MS Excel             read_excel     to_excel
binary      HDF5 Format          read_hdf       to_hdf
binary      Feather Format       read_feather   to_feather
binary      Parquet Format       read_parquet   to_parquet
binary      Msgpack              read_msgpack   to_msgpack
binary      Stata                read_stata     to_stata
binary      SAS                  read_sas    
binary      Python Pickle Format read_pickle    to_pickle
SQL         SQL                  read_sql       to_sql
SQL         Google Big Query     read_gbq       to_gbq

当从 Pandas 写入 csv 和 tab 文件时,我经常使用 index=False 选项来避免保存索引,默认情况下它会作为一个奇怪命名的列加载.

When writing csv and tab files from pandas, I often use the index=False option to avoid saving the index, which loads as an oddly-named column by default.

这篇关于Python 中用于数据科学的标准、稳定的文件格式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆