重新格式化 Pandas 中的 Sankey 数据集 [英] Reformatting dataset for Sankey in Pandas

查看：63 发布时间：2021/6/14 18:34:55 python pandas dataframe pandas-groupby sankey-diagram

本文介绍了重新格式化 Pandas 中的 Sankey 数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的数据在融化的 Pandas 数据框中(下面的数据代码):

I have my data in melted Pandas dataframe (code for data below):

<头>

学生	课程	顺序
杰瑞	A	1
杰瑞	B	2
杰瑞	C	NaN
杰西	C	1
杰西	A	2
杰西	B	3
拉斐尔	A	1
拉斐尔	C	2
拉斐尔	C	3
拉斐尔	B	4
莎莉	A	1
莎莉	B	2
莎莉	C	NaN

Sankey 需要这样的格式:

A Sankey requires a format like this:

<头>

课程1	课程2	课程3	课程4	计数
A	B			2
A	C	C	B	1
C	A	B		1

我无法理解如何为 order 的每个级别创建列，并在创建 的同时用 course 的值填充它count 列，计算具有相同序列的学生人数.

I can't wrap my head around how to create columns for each level of order and to populate that with the values of course while also creating the count column that counts the number of students with that same sequence.

如果我尝试 df.groupby('order')['course'].count() 然后它将组作为行返回，而不是我需要的列.

If I try df.groupby('order')['course'].count() then it returns the groups as rows, not columns like I need.

order
1.0    2682
2.0     578
3.0     197
4.0      89
5.0      27
6.0       8
7.0       1
Name: course, dtype: int64

它也不会创建填充最终表格所需的序列集.

It also doesn't create the sets of sequences that will need to populate the final table.

有人可以帮我将我的长桌重新格式化为一张包含课程序列所有计数的表格吗?

Can someone please help me reformat my long table into one with all of the counts of the sequences of the courses?

非常感谢任何帮助.

玩具数据:

student = ['Jerry','Jerry','Jerry','Jessy','Jessy','Jessy','Raphael','Raphael','Raphael','Raphael','Sally','Sally','Sally']
course = ['A','B','C','C','A','B','A','C','C','B','A','B','C']
order = [1,2,np.NaN,1,2,3,1,2,3,4,1,2,np.NaN]
df = pd.DataFrame({'student':student, 'course':course,'order':order})

推荐答案

步骤数可能会少一点，但我创建了以下流程.

The number of steps could be a little less, but I created the following flow.

删除 Na 值并添加课程名称列.
按课程名称转换为横向格式
将所有课程名称组合成一个字符串
按所有课程字符串汇总
合并原始数据框和聚合数据框
删除重复行并重命名列

df.dropna(axis=0, how='any', inplace=True)
df['course_gp'] = df['order'].apply(lambda x: 'course' + str(int(x)))
df = df.pivot(index='student', columns='course_gp', values='course')
df.fillna('', inplace=True)
df['course_all'] = df['course1'] + df['course2'] + df['course3'] + df['course4']
dfc = df.groupby('course_all').count()
df = df.merge(dfc[['course1']], left_on='course_all', right_on='course_all', how='inner' )
df.drop_duplicates(keep='first', inplace=True)
df.rename({'course1_y':'count','course1_x':'course1'}, axis=1, inplace=True)

<头>

	course1	course2	course3	course4	course_all	计数
0	A	B			AB	2
2	C	A	B		CAB	1
3	A	C	C	B	ACCB	1

这篇关于重新格式化 Pandas 中的 Sankey 数据集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

重新格式化 Pandas 中的 Sankey 数据集 [英] Reformatting dataset for Sankey in Pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

重新格式化 Pandas 中的 Sankey 数据集 [英] Reformatting dataset for Sankey in Pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭