使用Python合并2个csv数据集的公共ID列-一个csv具有多个记录的唯一ID [英] Merging 2 csv data sets with Python a common ID column- one csv has multiple records for a unique ID

查看:919
本文介绍了使用Python合并2个csv数据集的公共ID列-一个csv具有多个记录的唯一ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Python还是很陌生.我们非常感谢您的支持

I'm very new to Python.Any support is much appreciated

我有两个要使用Student_ID列合并的csv文件,并创建一个新的csv文件.

I have two csv files that I'm trying to Merge using a Student_ID column and create a new csv file.

csv 1:每个条目都有唯一的学生ID

csv 1 : every entry has a unique studentID

Student_ID    Age        Course       startYear
119           24         Bsc          2014

csv2:具有一个StudentID的多个记录,因为它为学生修读的每个学科都有一个新条目

csv2: has multiple records for a studentID as it has a new entry for every subject the student is taking

Student_ID            sub_name       marks      Sub_year_level
119                   Botany1        60         2
119                   Anatomy        70         2
119                   cell bio       75         3
129                   Physics1       78         2
129                   Math1          60         1 

我想合并两个csv文件,以便我拥有csv1中的所有记录和列以及新创建的其他列,我希望从csv2中获得每个学生每个subject_year_level的平均分数(必须计算).因此,最终的csv文件在所有记录中都将具有唯一的Student_Ids

i want to merge the two csv file so that I have all records and columns from csv1 and new additional created columns where I want to get from csv2 the average mark(has to be calculated) for each subject_year_level per student. So the final csv file will have unique Student_Ids in all records

我希望新的输出csv文件看起来像什么:

What I want my new output csv file to look like:

Student_ID  Age  Course  startYear  level1_avg_mark  levl2_avg_mark  levl3_avgmark
119         24   Bsc     2014       60               65              70

推荐答案

您可以使用 pivot_table

You can use pivot_table with join:

注意:参数fill_valueNaN替换为0,如有必要,将其删除,默认聚合函数为mean.

Notice: parameter fill_value replace NaN to 0, if not necessary remove it and default aggregate function is mean.

df2 = df2.pivot_table(index='Student_ID',  \
                      columns='Sub_year_level',  \
                      values='marks', \
                      fill_value=0) \
         .rename(columns='level{}_avg_mark'.format)
print (df2)
Sub_year_level  level1_avg_mark  level2_avg_mark  level3_avg_mark
Student_ID                                                       
119                           0               65               75
129                          60               78                0

df = df1.join(df2, on='Student_ID')
print (df)
   Student_ID  Age Course  startYear  level1_avg_mark  level2_avg_mark  \
0         119   24    Bsc       2014                0               65   

   level3_avg_mark  
0               75  

需要自定义功能:

print (df2)
   Student_ID  sub_name  marks  Sub_year_level
0         119   Botany1      0               2
1         119   Botany1      0               2
2         119   Anatomy     72               2
3         119  cell bio     75               3
4         129  Physics1     78               2
5         129     Math1     60               1


f = lambda x:  x[x != 0].mean()
df2 = df2.pivot_table(index='Student_ID',columns='Sub_year_level', values='marks',aggfunc=f)
        .rename(columns='level{}_avg_mark'.format).reset_index()
print (df2)
Sub_year_level  Student_ID  level1_avg_mark  level2_avg_mark  level3_avg_mark
0                      119              NaN             72.0             75.0
1                      129             60.0             78.0              NaN

这篇关于使用Python合并2个csv数据集的公共ID列-一个csv具有多个记录的唯一ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆