如何处理与 pandas 数据框相关的元数据? [英] How to handle meta data associated with a pandas dataframe?
问题描述
将元信息保存到数据框的最佳实践是什么?我知道以下编码实践
What is the best practice for saving meta information to a dataframe? I know of the following coding practice
import pandas as pd
df = pd.DataFrame([])
df.currency = 'USD'
df.measure = 'Price'
df.frequency = 'daily'
但是,正如本文所述向熊猫DataFrame添加元信息/元数据这与通过应用诸如"groupby,pivot,join或loc"之类的函数而丢失信息的风险有关,因为它们可能会返回未附加元数据的新DataFrame".
But as stated in this post Adding meta-information/metadata to pandas DataFrame this is associated with the risk of losing the information by appling functions such as "groupby, pivot, join or loc" as they may return "a new DataFrame without the metadata attached".
这是否仍然有效,或者同时有对元信息处理的更新?什么是替代编码实践?
Is this still valid or has there been an update to meta information processing in the meantime? What would be an alternative coding practice?
我认为构建单独的对象不是很合适.同时使用Multiindex也无法说服我.可以说,我想将带有价格的数据框除以带有收益的数据框.与Multiindices合作将非常涉及.
I do not think building a seperate object is very suitable. Also working with Multiindex does not convince me. Lets say I want to divide a dataframe with prices by a dataframe with earnings. Working with Multiindices would be very involved.
#define price DataFrame
p_index = pd.MultiIndex.from_tuples([['Apple', 'price', 'daily'],['MSFT', 'price', 'daily']])
price = pd.DataFrame([[90, 20], [85, 30], [70, 25]], columns=p_index)
# define earnings dataframe
e_index = pd.MultiIndex.from_tuples(
[['Apple', 'earnings', 'daily'], ['MSFT', 'earnings', 'daily']])
earnings=pd.DataFrame([[5000, 2000], [5800, 2200], [5100, 3000]],
columns=e_index)
price.divide(earnings.values, level=1, axis=0)
在上面的示例中,我什至不保证公司索引确实匹配.我可能需要调用pd.DataFrame.reindex()或类似的东西.在我看来,这不是一个好的编码实践.
In the example above I do not even ensure that the company indices really match. I would probably need to invoke a pd.DataFrame.reindex() or similar. This cannot be a good coding practice in my point of view.
在我看不到的上下文中,是否存在直接解决元信息问题的直接解决方案?
Is there a straightforward solution to the problem of handling meta information in that context that I don't see?
提前谢谢
推荐答案
我认为MultiIndexes是最好的选择,但是这种方式:
I think that MultiIndexes is the way to go, but this way:
daily_price_data = pd.DataFrame({'Apple': [90, 85, 30], 'MSFT':[20, 30, 25]})
daily_earnings_data = pd.DataFrame({'Apple': [5000, 58000, 5100], 'MSFT':[2000, 2200, 3000]})
data = pd.concat({'price':daily_price_data, 'earnings': daily_earnings_data}, axis=1)
data
earnings price
Apple MSFT Apple MSFT
0 5000 2000 90 20
1 58000 2200 85 30
2 5100 3000 30 25
然后,划分:
data['price'] / data['earnings']
If you find that your workflow makes more sense to have companies listed on the first level of the index, then pandas.DataFrame.xs will be very helpful:
data2 = data.reorder_levels([1,0], axis=1).sort_index(axis=1)
data2.xs('price', axis=1, level=-1) / data2.xs('earnings', axis=1, level=-1)
这篇关于如何处理与 pandas 数据框相关的元数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!