从Pandas Dataframe写入格式化的二进制文件 [英] Writing a formated binary file from a Pandas Dataframe

查看:361
本文介绍了从Pandas Dataframe写入格式化的二进制文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到了一些方法,可以用Python将格式化的二进制文件读取到Pandas,也就是说,我正在使用这段代码,该代码使用NumPy从以dtype给出的结构格式化的文件中读取.

I've seen some ways to read a formatted binary file in Python to Pandas, namely, I'm using this code that read using NumPy fromfile formatted with a structure given using dtype.

import numpy as np
import pandas as pd

input_file_name = 'test.hst'

input_file = open(input_file_name, 'rb')
header = input_file.read(96)

dt_header = np.dtype([('version', 'i4'),
                      ('copyright', 'S64'),
                      ('symbol', 'S12'),
                      ('period', 'i4'),
                      ('digits', 'i4'),
                      ('timesign', 'i4'),
                      ('last_sync', 'i4')])

header = np.fromstring(header, dt_header)

dt_records = np.dtype([('ctm', 'i4'),
                       ('open', 'f8'),
                       ('low', 'f8'),
                       ('high', 'f8'),
                       ('close', 'f8'),
                       ('volume', 'f8')])
records = np.fromfile(input_file, dt_records)

input_file.close()

df_records = pd.DataFrame(records)
# Now, do some changes in the individual values of df_records
# and then write it back to a binary file

现在,我的问题是如何将其写回到新文件中.我无法在NumPy中找到任何函数(在Pandas中都找不到),可以让我确切指定要在每个字段中写入的字节.

Now, my issue is on how to write this back to a new file. I can't find any function in NumPy (neither in Pandas) that allows me to specify exactly the bytes to use in each field to write.

推荐答案

熊猫现在提供

Pandas now offers a wide variety of formats that are more stable than tofile(). tofile() is best for quick file storage where you do not expect the file to be used on a different machine where the data may have a different endianness (big-/little-endian).

Format Type Data Description     Reader         Writer
text        CSV                  read_csv       to_csv
text        JSON                 read_json      to_json
text        HTML                 read_html      to_html
text        Local clipboard      read_clipboard to_clipboard
binary      MS Excel             read_excel     to_excel
binary      HDF5 Format          read_hdf       to_hdf
binary      Feather Format       read_feather   to_feather
binary      Parquet Format       read_parquet   to_parquet
binary      Msgpack              read_msgpack   to_msgpack
binary      Stata                read_stata     to_stata
binary      SAS                  read_sas    
binary      Python Pickle Format read_pickle    to_pickle
SQL         SQL                  read_sql       to_sql
SQL         Google Big Query     read_gbq       to_gbq

对于中小型文件,我更喜欢CSV,因为格式正确的CSV可以存储任意字符串数据,易于阅读,并且在实现上述两个目标的同时,它与任何格式一样简单.

For small to medium sized files, I prefer CSV, as properly-formatted CSV can store arbitrary string data, is human readable, and is as dirt-simple as any format can be while achieving the previous two goals.

一次,我使用了HDF5,但是如果我在亚马逊上,我会考虑使用镶木地板.

At one time, I used HDF5, but if I were on Amazon, I would consider using parquet.

使用 to_hdf 的示例:

df.to_hdf('tmp.hdf','df', mode='w')
df2 = pd.read_hdf('tmp.hdf','df')

我不再支持HDF5格式.由于相当复杂,因此具有长期存档的严重风险.它具有150页的规范,并且只有一个300,000行C实现.

I no longer favor the HDF5 format. It has serious risks for long-term archival since it is fairly complex. It has a 150 page specification, and only one 300,000 line C implementation.

相反,只要您专门使用Python, pickle格式声明长期稳定性:

In contrast, as long as you are working exclusively in Python, the pickle format claims long term stability:

泡菜序列化格式保证是向后的兼容所有Python版本,提供了兼容的pickle选择了协议,并且使用Python处理和解开代码如果您的数据跨越了唯一性,那么2至Python 3的类型差异打破改变语言的界限.

The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.

但是,泡菜允许任意代码执行,因此应谨慎处理来源不明的泡菜.

However, pickles allow arbitrary code execution so care should be exercised with pickles of unknown origin.

这篇关于从Pandas Dataframe写入格式化的二进制文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆