在Pandas Dataframe中保存其他属性 [英] Save additional attributes in Pandas Dataframe

查看:84
本文介绍了在Pandas Dataframe中保存其他属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我回想起MatLab使用结构化数组的时代,您可以在其中存储不同的数据作为主体结构的属性.像这样:

I recall from my MatLab days using structured arrays wherein you could store different data as an attribute of the main structure. Something like:

a = {}
a.A = magic(10)
a.B = magic(50); etc.

其中a.Aa.B彼此完全分开,使您可以在a中存储不同的类型并根据需要对其进行操作.熊猫允许我们做类似但不完全相同的事情.

where a.A and a.B are completely separate from each other allowing you to store different types within a and operate on them as desired. Pandas allows us to do something similar, but not quite the same.

我正在使用Pandas,并且希望存储数据框的属性,而无需实际将其放在数据框内.这可以通过以下方式完成:

I am using Pandas and want to store attributes of a dataframe without actually putting it within a dataframe. This can be done via:

import pandas as pd

a = pd.DataFrame(data=pd.np.random.randint(0,100,(10,5)),columns=list('ABCED')

# now store an attribute of <a>
a.local_tz = 'US/Eastern'

现在,本地时区存储在中,但是在保存数据帧时(即重新加载a之后,没有a.local_tz),我无法保存此属性.有没有办法保存这些属性?

Now, the local timezone is stored in a, but I cannot save this attribute when I save the dataframe (i.e. after re-loading a there is no a.local_tz). Is there a way to save these attributes?

当前,我只是在数据框中添加新列以保存诸如时区,纬度,经度等信息,但这似乎有点浪费.此外,当我对数据进行分析时,我遇到了必须排除其他列的问题.

Currently, I am just making new columns in the dataframe to hold information like timezone, latitude, longituded, etc., but this seems to be a bit of a waste. Further, when I do analysis on the data I run into problems of having to exclude these other columns.

################## 开始编辑 ##################

################## BEGIN EDIT ##################

根据unutbu的建议,我现在将数据存储为h5格式.如前所述,将元数据作为数据帧的属性加载回去是有风险的.但是,由于我是这些文件(以及处理算法)的创建者,因此我可以选择存储为元数据的内容和不存储为元数据的内容.处理将要进入h5文件的数据时,我选择将元数据存储在初始化为我的类的属性的字典中.我制作了一个简单的IO类来导入h5数据,并将元数据作为类属性.现在,我可以处理我的数据框,而不会丢失元数据.

Using unutbu's advice, I now store the data in h5 format. As mentioned, loading metadata back in as attributes of the dataframe is risky. However, since I am the creator of these files (and the processing algorithms) I can choose what is stored as metadata and what is not. When processing the data that will go into the h5 files, I choose to store the metadata in a dictionary that is initialized as an attribute of my classes. I made a simple IO class to import the h5 data, and made the metadata as class attributes. Now I can work on my dataframes without risk of losing the metadata.

class IO():
    def __init__(self):
        self.dtfrmt = 'dummy_str'

    def h5load(self,filename,update=False):
        '''h5load loads the stored HDF5 file.  Both the dataframe (actual data) and 
        the associated metadata are stored in the H5file

        NOTE: This does not load "any" H5 
        file, it loads H5 files specifically created to hold dataframe data and 
        metadata.

        When multi-indexed dataframes are stored in the H5 format the date 
        values (previously initialized with timezone information) lose their
        timezone localization.  Therefore, <h5load> re-localizes the 'DATE' 
        index as UTC.

        Parameters
        ----------
        filename : string/path
            path and filename of H5 file to be loaded.  H5 file must have been 
            created using <h5store> below.

        udatedf : boolean True/False
            default: False
            If the selected dataframe is to be updated then it is imported 
            slightly different.  If update==True, the <metadata> attribute is
            returned as a dictionary and <data> is returned as a dataframe 
            (i.e., as a stand-alone dictionary with no attributes, and NOT an 
            instance of the IO() class).  Otherwise, if False, <metadata> is 
            returned as an attribute of the class instance.

        Output
        ------
        data : Pandas dataframe with attributes
            The dataframe contains only the data as collected by the instrument.  
            Any metadata (e.g. timezone, scaling factor, basically anything that
            is constant throughout the file) is stored as an attribute (e.g. lat 
            is stored as <data.lat>).'''

        with pd.HDFStore(filename,'r') as store:
            self.data = store['mydata']
            self.metadata = store.get_storer('mydata').attrs.metadata    # metadata gets stored as attributes, so no need to make <metadata> an attribute of <self>

            # put metadata into <data> dataframe as attributes
            for r in self.metadata:
                setattr(self,r,self.metadata[r])

        # unscale data
        self.data, self.metadata = unscale(self.data,self.metadata,stringcols=['routine','date'])

        # when pandas stores multi-index dataframes as H5 files the timezone
        # initialization is lost.  Remake index with timezone initialized: only
        # for multi-indexed dataframes
        if isinstance(self.data.index,pd.core.index.MultiIndex):
            # list index-level names, and identify 'DATE' level
            namen = self.data.index.names
            date_lev = namen.index('DATE')

            # extract index as list and remake tuples with timezone initialized
            new_index = pd.MultiIndex.tolist(self.data.index)
            for r in xrange( len(new_index) ):
                tmp = list( new_index[r] )
                tmp[date_lev] = utc.localize( tmp[date_lev] )

                new_index[r] = tuple(tmp)

            # reset multi-index
            self.data.index = pd.MultiIndex.from_tuples( new_index, names=namen )


        if update:
            return self.metadata, self.data
        else:
            return self





    def h5store(self,data, filename, **kwargs):
        '''h5store stores the dataframe as an HDF5 file.  Both the dataframe 
        (actual data) and the associated metadata are stored in the H5file

        Parameters
        ----------
        data : Pandas dataframe NOT a class instance
            Must be a dataframe, not a class instance (i.e. cannot be an instance 
            named <data> that has an attribute named <data> (e.g. the Pandas 
            data frame is stored in data.data)).  If the dataframe is under
            data.data then the input variable must be data.data.

        filename : string/path
            path and filename of H5 file to be loaded.  H5 file must have been 
            created using <h5store> below.

        **kwargs : dictionary
            dictionary containing metadata information.


        Output
        ------
        None: only saves data to file'''

        with pd.HDFStore(filename,'w') as store:
            store.put('mydata',data)
            store.get_storer('mydata').attrs.metadata = kwargs

然后通过data = IO().h5load('filename.h5')加载

H5文件 数据帧存储在data.data下 我将元数据字典保留在data.metadata下,并创建了各个元数据属性(例如,从data.metadata['lat']创建的data.lat).

H5 files are then loaded via data = IO().h5load('filename.h5') the dataframe is stored under data.data I retain the metadata dictionary under data.metadata and have created individual metadata attributes (e.g. data.lat created from data.metadata['lat']).

我的索引时间戳记已本地化为pytz.utc().但是,当将多索引数据帧存储到h5时,时区本地化会丢失(使用Pandas 15.2),因此我在IO().h5load中对此进行了纠正.

My index time stamps are localized to pytz.utc(). However, when a multi-indexed dataframe is stored to h5 the timezone localization is lost (using Pandas 15.2), so I correct for this in IO().h5load.

推荐答案

关于自定义元数据存储在NDFrames中.但是由于熊猫函数可以通过多种方式返回DataFrames,因此_metadata属性尚未(在所有情况下)保留.

There is an open issue regarding the storage of custom metadata in NDFrames. But due to the multitudinous ways pandas functions may return DataFrames, the _metadata attribute is not (yet) preserved in all situations.

目前,您只需要将元数据存储在辅助变量中即可.

For the time being, you'll just have to store the metadata in an auxilliary variable.

有多个选项可将DataFrames +元数据存储到文件中,具体取决于您希望使用的格式-泡菜,JSON,HDF5都是可能的.

There are multiple options for storing DataFrames + metadata to files, depending on what format you wish to use -- pickle, JSON, HDF5 are all possibilities.

这是使用HDF5存储和加载带有元数据的DataFrame的方法.存储元数据的方法来自 Pandas Cookbook .

Here is how you could store and load a DataFrame with metadata using HDF5. The recipe for storing the metadata comes from the Pandas Cookbook.

import numpy as np
import pandas as pd

def h5store(filename, df, **kwargs):
    store = pd.HDFStore(filename)
    store.put('mydata', df)
    store.get_storer('mydata').attrs.metadata = kwargs
    store.close()

def h5load(store):
    data = store['mydata']
    metadata = store.get_storer('mydata').attrs.metadata
    return data, metadata

a = pd.DataFrame(
    data=pd.np.random.randint(0, 100, (10, 5)), columns=list('ABCED'))

filename = '/tmp/data.h5'
metadata = dict(local_tz='US/Eastern')
h5store(filename, a, **metadata)
with pd.HDFStore(filename) as store:
    data, metadata = h5load(store)

print(data)
#     A   B   C   E   D
# 0   9  20  92  43  25
# 1   2  64  54   0  63
# 2  22  42   3  83  81
# 3   3  71  17  64  53
# 4  52  10  41  22  43
# 5  48  85  96  72  88
# 6  10  47   2  10  78
# 7  30  80   3  59  16
# 8  13  52  98  79  65
# 9   6  93  55  40   3


print(metadata)

收益

{'local_tz': 'US/Eastern'}

这篇关于在Pandas Dataframe中保存其他属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆