pandas 根据存在的列生成数据框 [英] Pandas Generating dataframe based on columns being present

查看:60
本文介绍了 pandas 根据存在的列生成数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框

import pandas as pd

df=pd.DataFrame({'Players': [ 'Sam', 'Greg', 'Steve', 'Sam',
                 'Greg', 'Steve', 'Greg', 'Steve', 'Greg', 'Steve'],
                 'Wins': [10,5,5,20,30,20,6,9,3,10],
                 'Losses': [5,5,5,2,3,2,16,20,3,12],
                 'Type': ['A','B','B','B','A','B','B','A','A','B'],
                 })

如果我想总结一下我可以手动制作另一个数据框:

If I wanted to summarize the I can manually make another dataframe:

p=df.groupby('Players')

summary = pd.DataFrame({'Total Games': p.Players.count(),
                        'Average Wins':p.Wins.mean(),
                        'Greatest Wins':p.Wins.max(),
                        'Unique games':p.Type.nunique()})

让我说我想自动执行此汇总过程以创建数据帧(如果存在列X的话)执行汇总Y会怎样他这样做的最好方法是什么?我尝试使用字典,但我认为我做错了

Lets say I want to automate this summarizing process to create a dataframe if column X is present perform summarization Y what would the best way of doing this? I have attempted using a dictionary but I think I did something wrong

p=df.groupby('Players')
sumdict = {'Total Games': ['Players', p.Players.count()],
            'Average Wins':['Wins', p.Wins.mean()],
            'Greatest Wins':['Wins', p.Wins.max()],
            'Unique games':['Type', p.Type.nunique()],
            'Max Score':['Score', p.Score.max()]}

summary=pd.DataFrame()

for key, value in sumdict.items():
        if value[0] in df.columns:
            data = pd.DataFrame({key: value[1],})
            summary=summary.append(data)
        else:
            continue


推荐答案

熊猫 DataFrame 支持大多数 dict 方法,包括 get (可让您将值替换为空键)。因此,您可以对所有列进行所需的统计,然后获取所需列的值,将空的 Series 替换为缺少的列,然后删除 NaN 列(我使用 Bad Value 来说明缺少的列会发生什么情况):

Pandas DataFrames support most of the dict methods, include get (which allows you to substitute a value for an empty key). So you can do the statistics you want on all columns, then get the values for the column you want, substituting an empty Series for missing columns, then drop NaN columns (I use Bad Value to demonstrate what happens to missing columns):

eser = pd.Series()
count = p.count().max(axis=1)
all_max = p.max()
score_max = all_max.get('Score', eser)
wins_max = all_max.get('Wins', eser)
wins_mean = p.mean().get('Wins', eser)
type_nunique = p.agg(lambda x: x.nunique()).get('Type', eser)

summary = pd.DataFrame({'Total Games': count,
                        'Average Wins': wins_mean,
                        'Greatest Wins': wins_max,
                        'Unique games': type_nunique,
                        'Max Score': score_max})
summary.dropna(axis=1, how='all', inplace=True)

或单线(这涉及到计算 m ax 在所有列上两次,对于少量的值应该不是问题):

Or a one-liner (which involves calculating the max on all columns twice, which shouldn't be a problem for a small number of values):

summary = pd.DataFrame({'Total Games': p.count().max(axis=1),
                        'Average Wins': p.mean().get('Wins', pd.Series()),
                        'Greatest Wins': p.max().get('Wins', pd.Series()),
                        'Unique games': p.agg(lambda x: x.nunique()).get('Type', pd.Series()),
                        'Max Score': p.max().get('Score', pd.Series())}).dropna(axis=1, how='all')

两种方法的结果:

       Average Wins  Greatest Wins  Total Games  Unique games
Greg             11             30            4             2
Sam              15             20            2             2
Steve            11             20            4             2

没有 dropna

       Average Wins  Greatest Wins  Max Score  Total Games  Unique games
Greg             11             30        NaN            4             2
Sam              15             20        NaN            2             2
Steve            11             20        NaN            4             2

如果性能是一个问题,那么上述操作将很慢,因为它们需要在所有列上计算多个统计信息,这意味着要先计算然后丢弃统计信息。一种更快但更丑陋的方法类似于对 dict 使用循环的方法。

If performance is an issue, the above will be slow because they require computing multiple statistics on all columns, which means statistics are being computed and then thrown away. A faster, but uglier, approach is similar to your approach of using a loop over a dict.

实施中的问题是 dict 项没有被懒惰地评估,在创建 dict 时对它们进行评估,这意味着它仍然尝试访问不存在的列。

The problem with your implementation is that the dict items are not lazily evaluated, they are evaluated when the dict is created, which means it still tries to access the non-existent columns.

下面的方法仅在找到该列时才获取项目并应用该函数(对 count 情况,因为任何现有的列都可以使用):

The approach below gets both the item and applies the function only when the column is found (with special handling for the count case, since any existing column will work):

sumdict = {'Total Games': (None, 'count'),
           'Average Wins': ('Wins', 'mean'),
           'Greatest Wins': ('Wins', 'max'),
           'Unique games': ('Type', 'nunique'),
           'Max Score': ('Score', 'max')}

summary = []
for key, (column, op) in sumdict.items():
    if column is None:
        res = p.agg(op).max(axis=1)
    elif column not in df:
        continue
    else: 
        res = p[column].agg(lambda x: getattr(x, op)())
    summary.append(pd.DataFrame({key: res}))
summary = pd.concat(summary, axis=1)

尽管列顺序不同,但结果与上述方法相同。

It gives the same result as my above approaches, although with a different column order.

这篇关于 pandas 根据存在的列生成数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆