使用Python合并2个csv数据集的公共ID列-一个csv具有多个记录的唯一ID [英] Merging 2 csv data sets with Python a common ID column- one csv has multiple records for a unique ID
问题描述
我对Python还是很陌生.我们非常感谢您的支持
I'm very new to Python.Any support is much appreciated
我有两个要使用Student_ID列合并的csv文件,并创建一个新的csv文件.
I have two csv files that I'm trying to Merge using a Student_ID column and create a new csv file.
csv 1:每个条目都有唯一的学生ID
csv 1 : every entry has a unique studentID
Student_ID Age Course startYear
119 24 Bsc 2014
csv2:具有一个StudentID的多个记录,因为它为学生修读的每个学科都有一个新条目
csv2: has multiple records for a studentID as it has a new entry for every subject the student is taking
Student_ID sub_name marks Sub_year_level
119 Botany1 60 2
119 Anatomy 70 2
119 cell bio 75 3
129 Physics1 78 2
129 Math1 60 1
我想合并两个csv文件,以便我拥有csv1中的所有记录和列以及新创建的其他列,我希望从csv2中获得每个学生每个subject_year_level的平均分数(必须计算).因此,最终的csv文件在所有记录中都将具有唯一的Student_Ids
i want to merge the two csv file so that I have all records and columns from csv1 and new additional created columns where I want to get from csv2 the average mark(has to be calculated) for each subject_year_level per student. So the final csv file will have unique Student_Ids in all records
我希望新的输出csv文件看起来像什么:
What I want my new output csv file to look like:
Student_ID Age Course startYear level1_avg_mark levl2_avg_mark levl3_avgmark
119 24 Bsc 2014 60 65 70
推荐答案
您可以使用 pivot_table
与
You can use pivot_table
with join
:
注意:参数fill_value
将NaN
替换为0
,如有必要,将其删除,默认聚合函数为mean
.
Notice: parameter fill_value
replace NaN
to 0
, if not necessary remove it and default aggregate function is mean
.
df2 = df2.pivot_table(index='Student_ID', \
columns='Sub_year_level', \
values='marks', \
fill_value=0) \
.rename(columns='level{}_avg_mark'.format)
print (df2)
Sub_year_level level1_avg_mark level2_avg_mark level3_avg_mark
Student_ID
119 0 65 75
129 60 78 0
df = df1.join(df2, on='Student_ID')
print (df)
Student_ID Age Course startYear level1_avg_mark level2_avg_mark \
0 119 24 Bsc 2014 0 65
level3_avg_mark
0 75
需要自定义功能:
print (df2)
Student_ID sub_name marks Sub_year_level
0 119 Botany1 0 2
1 119 Botany1 0 2
2 119 Anatomy 72 2
3 119 cell bio 75 3
4 129 Physics1 78 2
5 129 Math1 60 1
f = lambda x: x[x != 0].mean()
df2 = df2.pivot_table(index='Student_ID',columns='Sub_year_level', values='marks',aggfunc=f)
.rename(columns='level{}_avg_mark'.format).reset_index()
print (df2)
Sub_year_level Student_ID level1_avg_mark level2_avg_mark level3_avg_mark
0 119 NaN 72.0 75.0
1 129 60.0 78.0 NaN
这篇关于使用Python合并2个csv数据集的公共ID列-一个csv具有多个记录的唯一ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!