命名Pandas集合函数中的返回列? [英] Naming returned columns in Pandas aggregate function?
问题描述
我在Pandas的groupby功能方面遇到麻烦。我已阅读文档,但我看不到图如何将聚合函数应用于多列和具有这些列的自定义名称。
这非常接近,但数据结构返回嵌套列标题:
data.groupby(Country)。agg(
{column1:{) foo:sum()},column2:{mean:np.mean,std:np.std}})
(也就是说,我想采用column2的均值和标准,但将这些列返回为mean和std)
我缺少什么?
这将从层级列索引中删除最外层:
df = data.groupby(...)。agg(...)
df.columns = df.columns.droplevel(0 )
如果您想保持最外层,您可以使用ravel()函数多层列以形成新标签:
df.columns = [_。join(x)for x in df.columns.ravel()]
例如:
将pandas导入为pd
将pandas.rpy.common导入为com
将numpy导入为np
data = com.load_data('Loblolly')
print(data.head())
#高度年龄种子
#1 4.51 3 301
#15 10.89 5 301
#29 28.72 10 301
#43 41.74 15 301
#57 52.70 20 301
df = data.groupby('Seed') .agg(
{'age':['sum'],
'height':['mean','std']})
print(df.head())
#年龄身高
#总和标准平均值
#种子
#301 78 22.638417 33.246667
#303 78 23.499706 34.106667
#305 78 23.927090 35.115000
#307 78 22.222266 31.328333
#309 78 23.132574 33.781667
df.columns = df.columns.droplevel(0)
print(df.head())
产量
总和std平均值
种子
301 78 22.638417 33.246667
303 78 23.499706 34.106667
305 78 23.927090 35.115000
307 78 22.222266 31.328333
309 78 23.132574 33.781667
或者,保持指数的第一级:
df = data。 groupby('Seed')。agg(
''age':['sum'],
'height':['mean','std']})
df.columns = [_。join(x)for df.columns.ravel()]
产量
age_sum height_std height_mean
种子
301 78 22.638417 33.246667
303 78 23.499706 34.106667
305 78 23.927090 35.115000
307 78 22.222266 31.3 28333
309 78 23.132574 33.781667
I'm having trouble with Pandas' groupby functionality. I've read the documentation, but I can't see to figure out how to apply aggregate functions to multiple columns and have custom names for those columns.
This comes very close, but the data structure returned has nested column headings:
data.groupby("Country").agg(
{"column1": {"foo": sum()}, "column2": {"mean": np.mean, "std": np.std}})
(ie. I want to take the mean and std of column2, but return those columns as "mean" and "std")
What am I missing?
This will drop the outermost level from the hierarchical column index:
df = data.groupby(...).agg(...)
df.columns = df.columns.droplevel(0)
If you'd like to keep the outermost level, you can use the ravel() function on the multi-level column to form new labels:
df.columns = ["_".join(x) for x in df.columns.ravel()]
For example:
import pandas as pd
import pandas.rpy.common as com
import numpy as np
data = com.load_data('Loblolly')
print(data.head())
# height age Seed
# 1 4.51 3 301
# 15 10.89 5 301
# 29 28.72 10 301
# 43 41.74 15 301
# 57 52.70 20 301
df = data.groupby('Seed').agg(
{'age':['sum'],
'height':['mean', 'std']})
print(df.head())
# age height
# sum std mean
# Seed
# 301 78 22.638417 33.246667
# 303 78 23.499706 34.106667
# 305 78 23.927090 35.115000
# 307 78 22.222266 31.328333
# 309 78 23.132574 33.781667
df.columns = df.columns.droplevel(0)
print(df.head())
yields
sum std mean
Seed
301 78 22.638417 33.246667
303 78 23.499706 34.106667
305 78 23.927090 35.115000
307 78 22.222266 31.328333
309 78 23.132574 33.781667
Alternatively, to keep the first level of the index:
df = data.groupby('Seed').agg(
{'age':['sum'],
'height':['mean', 'std']})
df.columns = ["_".join(x) for x in df.columns.ravel()]
yields
age_sum height_std height_mean
Seed
301 78 22.638417 33.246667
303 78 23.499706 34.106667
305 78 23.927090 35.115000
307 78 22.222266 31.328333
309 78 23.132574 33.781667
这篇关于命名Pandas集合函数中的返回列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!