重新格式化 Pandas 中的 Sankey 数据集 [英] Reformatting dataset for Sankey in Pandas
问题描述
我的数据在融化的 Pandas 数据框中(下面的数据代码):
I have my data in melted Pandas dataframe (code for data below):
学生 | 课程 | 顺序 |
---|---|---|
杰瑞 | A | 1 |
杰瑞 | B | 2 |
杰瑞 | C | NaN |
杰西 | C | 1 |
杰西 | A | 2 |
杰西 | B | 3 |
拉斐尔 | A | 1 |
拉斐尔 | C | 2 |
拉斐尔 | C | 3 |
拉斐尔 | B | 4 |
莎莉 | A | 1 |
莎莉 | B | 2 |
莎莉 | C | NaN |
Sankey 需要这样的格式:
A Sankey requires a format like this:
课程1 | 课程2 | 课程3 | 课程4 | 计数 |
---|---|---|---|---|
A | B | 2 | ||
A | C | C | B | 1 |
C | A | B | 1 |
我无法理解如何为 order
的每个级别创建列,并在创建 的同时用
列,计算具有相同序列的学生人数.course
的值填充它count
I can't wrap my head around how to create columns for each level of order
and to populate that with the values of course
while also creating the count
column that counts the number of students with that same sequence.
如果我尝试 df.groupby('order')['course'].count()
然后它将组作为行返回,而不是我需要的列.
If I try df.groupby('order')['course'].count()
then it returns the groups as rows, not columns like I need.
order
1.0 2682
2.0 578
3.0 197
4.0 89
5.0 27
6.0 8
7.0 1
Name: course, dtype: int64
它也不会创建填充最终表格所需的序列集.
It also doesn't create the sets of sequences that will need to populate the final table.
有人可以帮我将我的长桌重新格式化为一张包含课程序列所有计数的表格吗?
Can someone please help me reformat my long table into one with all of the counts of the sequences of the courses?
非常感谢任何帮助.
玩具数据:
student = ['Jerry','Jerry','Jerry','Jessy','Jessy','Jessy','Raphael','Raphael','Raphael','Raphael','Sally','Sally','Sally']
course = ['A','B','C','C','A','B','A','C','C','B','A','B','C']
order = [1,2,np.NaN,1,2,3,1,2,3,4,1,2,np.NaN]
df = pd.DataFrame({'student':student, 'course':course,'order':order})
推荐答案
步骤数可能会少一点,但我创建了以下流程.
The number of steps could be a little less, but I created the following flow.
- 删除 Na 值并添加课程名称列.
- 按课程名称转换为横向格式
- 将所有课程名称组合成一个字符串
- 按所有课程字符串汇总
- 合并原始数据框和聚合数据框
- 删除重复行并重命名列
df.dropna(axis=0, how='any', inplace=True)
df['course_gp'] = df['order'].apply(lambda x: 'course' + str(int(x)))
df = df.pivot(index='student', columns='course_gp', values='course')
df.fillna('', inplace=True)
df['course_all'] = df['course1'] + df['course2'] + df['course3'] + df['course4']
dfc = df.groupby('course_all').count()
df = df.merge(dfc[['course1']], left_on='course_all', right_on='course_all', how='inner' )
df.drop_duplicates(keep='first', inplace=True)
df.rename({'course1_y':'count','course1_x':'course1'}, axis=1, inplace=True)
course1 | course2 | course3 | course4 | course_all | 计数 | |
---|---|---|---|---|---|---|
0 | A | B | AB | 2 | ||
2 | C | A | B | CAB | 1 | |
3 | A | C | C | B | ACCB | 1 |
这篇关于重新格式化 Pandas 中的 Sankey 数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!