通过联接传播 pandas 系列元数据 [英] Propagate pandas series metadata through joins

查看:87
本文介绍了通过联接传播 pandas 系列元数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够将元数据附加到一系列数据帧(特别是原始文件名)上,以便在加入两个数据帧之后,我可以看到每个系列来自何处的元数据.

I'd like to be able attach metadata to the series of dataframes (specifically, the original filename), so that after joining two dataframes I can see metadata on where each of the series came from.

我看到有关_metadata的github问题(此处

I see github issues regarding _metadata (here, here), including some relating to the current _metadata attribute (here), but nothing in the pandas docs.

到目前为止,我可以修改_metadata属性以允许保留元数据,但在加入后获得AttributeError.

So far I can modify the _metadata attribute to supposedly allow preservation of metadata, but get an AttributeError after the join.

df1 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df2 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df1._metadata.append('filename')
df1[df1.columns[0]]._metadata.append('filename')

for c in df1:
    df1[c].filename = 'fname1.csv'
    df2[c].filename = 'fname2.csv'

df1[0]._metadata  # ['name', 'filename']
df1[0].filename  # fname1.csv
df2[0].filename  # fname2.csv
df1[0][:3].filename  # fname1.csv

mgd = pd.merge(df1, df2, on=[0])
mgd['1_x']._metadata  # ['name', 'filename']
mgd['1_x'].filename  # raises AttributeError

有什么方法可以保存这个?

Any way to preserve this?

更新:结束语

此处所述,__finalize__无法跟踪属于以下成员的系列一个数据框,仅独立系列.因此,现在,我将通过维护附加到数据帧的元数据字典来跟踪系列级别的元数据.我的代码如下:

As discussed here, __finalize__ cannot keep track of Series that are members of a dataframe, only independent series. So for now I'll keep track of the Series-level metadata by maintaining a dictionary of metadata attached to the dataframes. My code looks like:

def cust_merge(d1, d2):
    "Custom merge function for 2 dicts"
    ...

def finalize_df(self, other, method=None, **kwargs):
    for name in self._metadata:
        if method == 'merge':
            lmeta = getattr(other.left, name, {})
            rmeta = getattr(other.right, name, {})
            newmeta = cust_merge(lmeta, rmeta)
            object.__setattr__(self, name, newmeta)
        else:
            object.__setattr__(self, name, getattr(other, name, None))
    return self

df1.filenames = {c: 'fname1.csv' for c in df1}
df2.filenames = {c: 'fname2.csv' for c in df2}
pd.DataFrame._metadata = ['filenames']
pd.DataFrame.__finalize__ = finalize_df

推荐答案

我认为类似的方法可以工作(如果不能,请提交错误报告,尽管受支持有些前沿,但现在有可能join方法不会一直调用此方法.这有点未经测试).

I think something like this will work (and if not, pls file a bug report as this, while supported is a bit bleading edge, iow it IS possible that the join methods don't call this all the time. That is a bit untested).

有关更详细的示例/错误修复,请参见此问题.

See this issue for a more detailed example/bug fix.

DataFrame._metadata = ['name','filename']


def __finalize__(self, other, method=None, **kwargs):
    """
    propagate metadata from other to self

    Parameters
    ----------
    other : the object from which to get the attributes that we are going
        to propagate
    method : optional, a passed method name ; possibly to take different
        types of propagation actions based on this

    """

    ### you need to arbitrate when their are conflicts

    for name in self._metadata:
        object.__setattr__(self, name, getattr(other, name, None))
    return self

    DataFrame.__finalize__ = __finalize__

因此,这用您的自定义变量替换了DataFrame的默认终结器.在我已经指出的地方,您需要放置一些可以在冲突之间进行仲裁的代码.这就是默认情况下不执行此操作的原因,例如frame1的名称为'foo',frame2的名称为'bar',当方法为__add__时该怎么办,另一种方法呢?让我们知道您的工作及其工作方式.

So this replaces the default finalizer for DataFrame with your custom one. Where I have indicated, you need to put some code which can arbitrate between conflicts. This is the reason this is not done by default, e.g. frame1 has name 'foo' and frame2 has name 'bar', what do you do when the method is __add__, what about another method?. Let us know what you do and how it works out.

这仅替代DataFrame(并且您可以根据需要简单地执行默认操作),该行为会将其他自身传播给自己;您也不能设置任何东西,除非在特殊情况下使用方法.

This is ONLY replacing for DataFrame (and you can simply do the default action if you want), which is to propogate other to self; you can also not set anything except under special cases of method.

如果要使用子类,则应重写此方法,这就是为什么要在此处修补猴子(而不是子类,这在大多数情况下是过大的杀伤力).

This method is meant to be overriden if sub-classes, that's why you are monkey patching here (rather than sub-classing which is most of the time overkill).

这篇关于通过联接传播 pandas 系列元数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆