根据其分组属性过滤DataFrame [英] Filtering DataFrame based on its groups properties
本文介绍了根据其分组属性过滤DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
假设我们有问题跟踪日志,我们想找出问题所有者(记录问题时间最多的人)
Let's say we have issue tracker logs and we want to find out issues owners (guys who logged the most time to the issue)
- 用户可以多次记录同一问题的时间
- 如果2个用户同时登录,则均为所有者
所以我们有一些示例数据:
So we have some sample data:
df = pd.DataFrame([
[1, 10, 'John'],
[1, 20, 'John'],
[1, 30, 'Tom'],
[1, 10, 'Bob'],
[2, 25, 'John'],
[2, 15, 'Bob']], columns = ['IssueKey','TimeSpent','User'])
作为输出,我们想要这样的东西:
As the output we want something like this:
issues_owners = pd.DataFrame([
[1, 30, 'John'],
[1, 30, 'Tom'],
[2, 25, 'John']], columns = ['IssueKey','TimeSpent','User'])
- John和Tom都是问题
1
的所有者,因为他们都花了30分钟的时间. - John实际上在另外2天中处理了问题
1
- John也是问题
2
的所有者
- 鲍勃很懒,没有任何问题:)
- Both John and Tom are owners of issue
1
, as they both spent 30 minutes on it. - John actually worked on issue
1
on 2 separate days - John is also the owner of the issue
2
- Bob is lazy and doesn't own any issues :)
我想到的东西让人很恶心(我是Python的新手):
What I came up with feels quite disgusting (I'm relatively new to Python):
df = df.groupby(['IssueKey', 'User']).sum().reset_index()
maxTimesPerIssue = df.groupby('IssueKey')['TimeSpent'].max().reset_index()
maxTimesPerIssue = dict(zip(maxTimesPerIssue['IssueKey'], maxTimesPerIssue['TimeSpent']))
df['MaxTimePerIssue'] = [maxTimesPerIssue[key] for key in df['IssueKey']]
df = df[df.MaxTimePerIssue == df.TimeSpent]
df = df.drop(columns=['MaxTimePerIssue'])
我不喜欢自己的Python代码:
What I dislike about my Python code:
-
maxTimesPerIssue
出现在处理df
破坏思维过程(或管道)的过程中 -
maxTimesPerIssue
本身的创建有点混乱 - 添加
MaxTimePerIssue
df
- 由于使用了许多低级的东西,例如
reset_index()
,list()
,dict()
,列表解析,删除列 ,因此,绝对比C#版本的解释器少.
maxTimesPerIssue
appears in the middle of processing thedf
disrupting the thought process (or pipeline)- The creation of
maxTimesPerIssue
itself is kind of messy - Adding
MaxTimePerIssue
thedf
- It's definitely way less self-explanatory than the C# version, due to using lots of low level stuff like:
reset_index()
,list()
,dict()
, list comprehensions, dropping columns
有人可以帮我清理吗?
推荐答案
类似于groupby
的内容适用于您的数据:
Something along the lines of a groupby
will work for your data:
i = df.groupby(['IssueKey', 'User']).TimeSpent.sum()
j = i.groupby(level=0).transform('max')
i[i == j].reset_index()
IssueKey User TimeSpent
0 1 John 30
1 1 Tom 30
2 2 John 25
这篇关于根据其分组属性过滤DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文