数据透视表按索引聚合 [英] Pivot table aggregation by index
问题描述
我有两个数据框,其中包含有关学生成绩和考试成绩的信息.第一个看起来像这样:
I have two dataframes containing information about student grades and test scores. The first looks like this:
ID Test_Score Class1 Class2 Class3
0 001 85 B- A C+
1 002 78 B NaN B+
2 003 93 A B NaN
...
第二个看起来像这样:
0 1
0 Algebra A
0 Calculus_1 B
0 Calculus_2 C-
1 Algebra C+
1 Trig F
1 Trig C
1 Calculus_1 C-
...
每个数据框中的索引指的是同一位学生.因此,两个数据帧中索引为0的学生都是相同的.
The indices in each dataframe refer to the same student. So the student at index 0 is the same in both dataframes.
我想做的是从第二个数据框中创建一个数据透视表,其中的行对应于学生(即索引),列是数学课,值是他们在每个课中取得的最高分(因为学生有可能不止一次上课).然后,我将其连接到第一个数据帧.
What I want to do is create a pivot table from the second dataframe, where the rows correspond to students (i.e. indices), the columns are the math classes, and the values are the highest grade they made in each class (since it's possible for a student to have taken a class more than once). I will then concatenate this to the first dataframe.
我已经写了一个聚合函数,可以找到最高的分数.
I have already written an aggregation function that finds the highest grade.
我尝试以下操作,因为它会失败:
I tried the following knowing it would fail:
p = pd.pivot_table(u, columns=0, values=1, aggfunc=highest)
我得到的是单行,而不是为每个学生返回一行数据,而是包含给定班级中任何学生的最高成绩:
Instead of returning a dataframe with a row for each student, I get a single row that contains the highest grade that any student made in the given class:
Algebra Trig Precalculus Calculus_1 Calculus_2
1 A A+ A+ A A
在我脑海中解决问题的方式,我只需要汇总共享索引的行,就不能确定该怎么做.
The way I've framed the problem in my mind, I just need to aggregate across the rows which share in index, which I am unsure how to do.
我也愿意接受完全不同的方法.
I'm also open to an entirely different approach.
所需的输出:
ID Test_Score Class1 Class2 Class3 Algebra Trig ...
0 001 85 B- A C+ A NaN
1 002 78 B NaN B+ C+ C
2 003 93 A B NaN B B-
...
这是我的最高"功能的代码:
Here is the code for my 'highest' function:
def highest(x):
q = 0
z = None
for g in x:
if qpoints(g) > q:
q = qpoints(g)
z = g
return z
其中qpoints是我已经在其他地方使用的以下函数:
where qpoints is the following function that I was already using elsewhere:
def qpoints(x):
qvalue = {'W': 0,
'F': 0,
'D': 1.0,
'D+': 1.33,
'C-': 1.67,
'C': 2.0,
'C+': 2.33,
'B-': 2.67,
'B': 3.0,
'B+': 3.33,
'A-': 3.84,
'A': 4.0,
'A+': 4.0}
return qvalue[x]
推荐答案
您的问题是,您在pivot_table
时需要保留第二个数据帧的索引值,请参见
You problem is that you need to keep the index values of the second dataframe when you pivot_table
, see this answer for understanding. So if you do:
print (df2.reset_index().pivot_table(index='index', values=[1], columns=[0],
aggfunc= lambda x: sorted(x)[0]))
# I used my own idea of highest function
1
0 Algebra Calculus_1 Calculus_2 Trig
index
0 A B C- NaN
1 C+ C- NaN C
然后您可以 join
,例如:
then you can join
such as:
df_p = df2.reset_index().pivot_table(index='index', values=[1], columns=[0],
aggfunc= lambda x: sorted(x)[0])
df_p.columns = [col[1] for col in df_p.columns]
new_df = df1.join(df_p)
print (new_df)
ID Test_Score Class1 Class2 Class3 Algebra Calculus_1 Calculus_2 Trig
0 001 85 B- A C+ A B C- NaN
1 002 78 B NaN B+ C+ C- NaN C
2 003 93 A B NaN NaN NaN NaN NaN
这篇关于数据透视表按索引聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!