Python提取新数据帧 [英] Python extracting new dataframe

查看:48
本文介绍了Python提取新数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框:

  topic  student level 
    1      a       1     
    1      b       2     
    1      a       3     
    2      a       1     
    2      b       2     
    2      a       3     
    2      b       4     
    3      c       1     
    3      b       2     
    3      c       3     
    3      a       4     
    3      b       5  

它包含一个列级别,指定谁发起了该主题以及谁回复了该主题.如果级别为 1,则表示学生开始了该主题.如果级别为 2,则表示学生回复了发起该主题的学生.如果级别为 3,则表示学生回复了级别 2 及以上级别的学生.

It contains a column level that specifies who started the topic and who replied to it. If a level is 1, it means that a student started the topic. If a level is 2, it means that a student replied to student who started the topic. If a level is 3, it means that a student replied to student at level 2 and on and on.

我想提取一个新的数据框,它应该通过主题呈现学生之间的交流.它应该包含三列:学生来源"、学生目的地"和回复计数".回复计数是学生目的地直接"回复学生来源的次数.

I would like to extract a new dataframe that should present a communication between students through the topic. It should contain three columns: "student source", "student destination" and "reply count". Reply count is a number of times in which Student Destination "directly" replied to Student Source.

我应该得到类似的东西:

I should get something like:

   st_source st_dest reply_count
        a        b       4
        a        c       0
        b        a       2
        b        c       1
        c        a       1
        c        b       1

我尝试使用此代码查找前两列..

I tried to find first two columns using this code..

idx_cols = ['topic']
std_cols = ['student_x', 'student_y']
df1 = df.merge(df, on=idx_cols)
df2 = df1.loc[f1.student_x != f1.student_y, idx_cols + std_cols]

df2.loc[:, std_cols] = np.sort(df2.loc[:, std_cols])

有人对第三列有什么建议吗?

Does anyone have some suggestions for the third column?

先谢谢你!

推荐答案

假设您的数据已经按主题、学生和级别排序.如果没有,请先排序.

Assume your data is already sorted by topic,student and then level. If not, please sort it first.

#generate the reply_count for each valid combination by comparing the current row and the row above.
count_list = df.apply(lambda x: [df.ix[x.name-1].student if x.name >0 else np.nan, x.student, x.level>1], axis=1).values

#create a count dataframe using the count_list data
df_count = pd.DataFrame(columns=['st_source','st_dest','reply_count'], data=count_list)

#Aggregate and sum all counts belonging to a source-dest pair, finally remove rows with same source and dest.
df_count = df_count.groupby(['st_source','st_dest']).sum().astype(int).reset_index()[lambda x: x.st_source != x.st_dest]

print(df_count)
Out[218]: 
  st_source st_dest  reply_count
1         a       b            4
2         b       a            2
3         b       c            1
4         c       a            1
5         c       b            1

这篇关于Python提取新数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆