pandas 根据存在的列生成数据框 [英] Pandas Generating dataframe based on columns being present
问题描述
我有一个数据框
import pandas as pd
df=pd.DataFrame({'Players': [ 'Sam', 'Greg', 'Steve', 'Sam',
'Greg', 'Steve', 'Greg', 'Steve', 'Greg', 'Steve'],
'Wins': [10,5,5,20,30,20,6,9,3,10],
'Losses': [5,5,5,2,3,2,16,20,3,12],
'Type': ['A','B','B','B','A','B','B','A','A','B'],
})
如果我想总结一下我可以手动制作另一个数据框:
If I wanted to summarize the I can manually make another dataframe:
p=df.groupby('Players')
summary = pd.DataFrame({'Total Games': p.Players.count(),
'Average Wins':p.Wins.mean(),
'Greatest Wins':p.Wins.max(),
'Unique games':p.Type.nunique()})
让我说我想自动执行此汇总过程以创建数据帧(如果存在列X的话)执行汇总Y会怎样他这样做的最好方法是什么?我尝试使用字典,但我认为我做错了
Lets say I want to automate this summarizing process to create a dataframe if column X is present perform summarization Y what would the best way of doing this? I have attempted using a dictionary but I think I did something wrong
p=df.groupby('Players')
sumdict = {'Total Games': ['Players', p.Players.count()],
'Average Wins':['Wins', p.Wins.mean()],
'Greatest Wins':['Wins', p.Wins.max()],
'Unique games':['Type', p.Type.nunique()],
'Max Score':['Score', p.Score.max()]}
summary=pd.DataFrame()
for key, value in sumdict.items():
if value[0] in df.columns:
data = pd.DataFrame({key: value[1],})
summary=summary.append(data)
else:
continue
推荐答案
熊猫 DataFrame
支持大多数 dict
方法,包括 get
(可让您将值替换为空键)。因此,您可以对所有列进行所需的统计,然后获取所需列的值,将空的 Series
替换为缺少的列,然后删除 NaN
列(我使用 Bad Value
来说明缺少的列会发生什么情况):
Pandas DataFrame
s support most of the dict
methods, include get
(which allows you to substitute a value for an empty key). So you can do the statistics you want on all columns, then get the values for the column you want, substituting an empty Series
for missing columns, then drop NaN
columns (I use Bad Value
to demonstrate what happens to missing columns):
eser = pd.Series()
count = p.count().max(axis=1)
all_max = p.max()
score_max = all_max.get('Score', eser)
wins_max = all_max.get('Wins', eser)
wins_mean = p.mean().get('Wins', eser)
type_nunique = p.agg(lambda x: x.nunique()).get('Type', eser)
summary = pd.DataFrame({'Total Games': count,
'Average Wins': wins_mean,
'Greatest Wins': wins_max,
'Unique games': type_nunique,
'Max Score': score_max})
summary.dropna(axis=1, how='all', inplace=True)
或单线(这涉及到计算 m ax
在所有列上两次,对于少量的值应该不是问题):
Or a one-liner (which involves calculating the max
on all columns twice, which shouldn't be a problem for a small number of values):
summary = pd.DataFrame({'Total Games': p.count().max(axis=1),
'Average Wins': p.mean().get('Wins', pd.Series()),
'Greatest Wins': p.max().get('Wins', pd.Series()),
'Unique games': p.agg(lambda x: x.nunique()).get('Type', pd.Series()),
'Max Score': p.max().get('Score', pd.Series())}).dropna(axis=1, how='all')
两种方法的结果:
Average Wins Greatest Wins Total Games Unique games
Greg 11 30 4 2
Sam 15 20 2 2
Steve 11 20 4 2
没有 dropna
:
Average Wins Greatest Wins Max Score Total Games Unique games
Greg 11 30 NaN 4 2
Sam 15 20 NaN 2 2
Steve 11 20 NaN 4 2
如果性能是一个问题,那么上述操作将很慢,因为它们需要在所有列上计算多个统计信息,这意味着要先计算然后丢弃统计信息。一种更快但更丑陋的方法类似于对 dict
使用循环的方法。
If performance is an issue, the above will be slow because they require computing multiple statistics on all columns, which means statistics are being computed and then thrown away. A faster, but uglier, approach is similar to your approach of using a loop over a dict
.
实施中的问题是 dict
项没有被懒惰地评估,在创建 dict
时对它们进行评估,这意味着它仍然尝试访问不存在的列。
The problem with your implementation is that the dict
items are not lazily evaluated, they are evaluated when the dict
is created, which means it still tries to access the non-existent columns.
下面的方法仅在找到该列时才获取项目并应用该函数(对 count $ c $进行了特殊处理c>情况,因为任何现有的列都可以使用):
The approach below gets both the item and applies the function only when the column is found (with special handling for the count
case, since any existing column will work):
sumdict = {'Total Games': (None, 'count'),
'Average Wins': ('Wins', 'mean'),
'Greatest Wins': ('Wins', 'max'),
'Unique games': ('Type', 'nunique'),
'Max Score': ('Score', 'max')}
summary = []
for key, (column, op) in sumdict.items():
if column is None:
res = p.agg(op).max(axis=1)
elif column not in df:
continue
else:
res = p[column].agg(lambda x: getattr(x, op)())
summary.append(pd.DataFrame({key: res}))
summary = pd.concat(summary, axis=1)
尽管列顺序不同,但结果与上述方法相同。
It gives the same result as my above approaches, although with a different column order.
这篇关于 pandas 根据存在的列生成数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!