如何处理与 pandas 数据框相关的元数据? [英] How to handle meta data associated with a pandas dataframe?

查看:105
本文介绍了如何处理与 pandas 数据框相关的元数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将元信息保存到数据框的最佳实践是什么?我知道以下编码实践

What is the best practice for saving meta information to a dataframe? I know of the following coding practice

import pandas as pd
df = pd.DataFrame([])
df.currency = 'USD'
df.measure = 'Price'
df.frequency = 'daily'

但是,正如本文所述向熊猫DataFrame添加元信息/元数据这与通过应用诸如"groupby,pivot,join或loc"之类的函数而丢失信息的风险有关,因为它们可能会返回未附加元数据的新DataFrame".

But as stated in this post Adding meta-information/metadata to pandas DataFrame this is associated with the risk of losing the information by appling functions such as "groupby, pivot, join or loc" as they may return "a new DataFrame without the metadata attached".

这是否仍然有效,或者同时有对元信息处理的更新?什么是替代编码实践?

Is this still valid or has there been an update to meta information processing in the meantime? What would be an alternative coding practice?

我认为构建单独的对象不是很合适.同时使用Multiindex也无法说服我.可以说,我想将带有价格的数据框除以带有收益的数据框.与Multiindices合作将非常涉及.

I do not think building a seperate object is very suitable. Also working with Multiindex does not convince me. Lets say I want to divide a dataframe with prices by a dataframe with earnings. Working with Multiindices would be very involved.

#define price DataFrame
p_index = pd.MultiIndex.from_tuples([['Apple', 'price', 'daily'],['MSFT', 'price', 'daily']])
price = pd.DataFrame([[90, 20], [85, 30], [70, 25]], columns=p_index)

# define earnings dataframe
e_index = pd.MultiIndex.from_tuples(
    [['Apple', 'earnings', 'daily'], ['MSFT', 'earnings', 'daily']])
earnings=pd.DataFrame([[5000, 2000], [5800, 2200], [5100, 3000]], 
                columns=e_index)

price.divide(earnings.values, level=1, axis=0)

在上面的示例中,我什至不保证公司索引确实匹配.我可能需要调用pd.DataFrame.reindex()或类似的东西.在我看来,这不是一个好的编码实践.

In the example above I do not even ensure that the company indices really match. I would probably need to invoke a pd.DataFrame.reindex() or similar. This cannot be a good coding practice in my point of view.

在我看不到的上下文中,是否存在直接解决元信息问题的直接解决方案?

Is there a straightforward solution to the problem of handling meta information in that context that I don't see?

提前谢谢

推荐答案

我认为MultiIndexes是最好的选择,但是这种方式:

I think that MultiIndexes is the way to go, but this way:

daily_price_data = pd.DataFrame({'Apple': [90, 85, 30], 'MSFT':[20, 30, 25]})
daily_earnings_data = pd.DataFrame({'Apple': [5000, 58000, 5100], 'MSFT':[2000, 2200, 3000]})
data = pd.concat({'price':daily_price_data, 'earnings': daily_earnings_data}, axis=1)
data


    earnings        price
    Apple   MSFT    Apple   MSFT
0   5000    2000    90      20
1   58000   2200    85      30
2   5100    3000    30      25

然后,划分:

data['price'] / data['earnings']

如果您发现工作流程更适合在索引的第一级列出公司,请

If you find that your workflow makes more sense to have companies listed on the first level of the index, then pandas.DataFrame.xs will be very helpful:

data2 = data.reorder_levels([1,0], axis=1).sort_index(axis=1)
data2.xs('price', axis=1, level=-1) / data2.xs('earnings', axis=1, level=-1)

这篇关于如何处理与 pandas 数据框相关的元数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆