根据其分组属性过滤DataFrame [英] Filtering DataFrame based on its groups properties

查看:112
本文介绍了根据其分组属性过滤DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有问题跟踪日志,我们想找出问题所有者(记录问题时间最多的人)

Let's say we have issue tracker logs and we want to find out issues owners (guys who logged the most time to the issue)

  1. 用户可以多次记录同一问题的时间
  2. 如果2个用户同时登录,则均为所有者

所以我们有一些示例数据:

So we have some sample data:

df = pd.DataFrame([
        [1, 10, 'John'],
        [1, 20, 'John'],
        [1, 30, 'Tom'],
        [1, 10, 'Bob'],
        [2, 25, 'John'],
        [2, 15, 'Bob']], columns = ['IssueKey','TimeSpent','User'])

作为输出,我们想要这样的东西:

As the output we want something like this:

issues_owners = pd.DataFrame([
        [1, 30, 'John'],
        [1, 30, 'Tom'],
        [2, 25, 'John']], columns = ['IssueKey','TimeSpent','User'])

  1. John和Tom都是问题1的所有者,因为他们都花了30分钟的时间.
  2. John实际上在另外2天中处理了问题1
  3. John也是问题2
  4. 的所有者
  5. 鲍勃很懒,没有任何问题:)
  1. Both John and Tom are owners of issue 1, as they both spent 30 minutes on it.
  2. John actually worked on issue 1 on 2 separate days
  3. John is also the owner of the issue 2
  4. Bob is lazy and doesn't own any issues :)

我想到的东西让人很恶心(我是Python的新手):

What I came up with feels quite disgusting (I'm relatively new to Python):

df = df.groupby(['IssueKey', 'User']).sum().reset_index()
maxTimesPerIssue = df.groupby('IssueKey')['TimeSpent'].max().reset_index()
maxTimesPerIssue = dict(zip(maxTimesPerIssue['IssueKey'], maxTimesPerIssue['TimeSpent']))
df['MaxTimePerIssue'] = [maxTimesPerIssue[key] for key in df['IssueKey']]
df = df[df.MaxTimePerIssue == df.TimeSpent]
df = df.drop(columns=['MaxTimePerIssue'])   

我不喜欢自己的Python代码:

What I dislike about my Python code:

  1. maxTimesPerIssue出现在处理df破坏思维过程(或管道)的过程中
  2. maxTimesPerIssue本身的创建有点混乱
  3. 添加MaxTimePerIssue df
  4. 由于使用了许多低级的东西,例如reset_index()list()dict(),列表解析,删除列
  5. ,因此,绝对比C#版本的解释器少.
  1. maxTimesPerIssue appears in the middle of processing the df disrupting the thought process (or pipeline)
  2. The creation of maxTimesPerIssue itself is kind of messy
  3. Adding MaxTimePerIssue the df
  4. It's definitely way less self-explanatory than the C# version, due to using lots of low level stuff like: reset_index(), list(), dict(), list comprehensions, dropping columns

有人可以帮我清理吗?

推荐答案

类似于groupby的内容适用于您的数据:

Something along the lines of a groupby will work for your data:

i = df.groupby(['IssueKey', 'User']).TimeSpent.sum()
j = i.groupby(level=0).transform('max')

i[i == j].reset_index()

   IssueKey  User  TimeSpent
0         1  John         30
1         1   Tom         30
2         2  John         25

这篇关于根据其分组属性过滤DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆