groupby返回第n组-NOT行 [英] groupby to return nth group - NOT row

查看:58
本文介绍了groupby返回第n组-NOT行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试按长(> 2M)行中的两个因素进行分组.

数据背景

第二个因素实际上是测试日期-对于给定的样品(第一组),可以重新测试样品.但是测试可能会更改样本,因此能够有选择地从第二个测试(或第三,第四等)中提取一批测试数据很有用.

更早的问题和当前代码

不久前我问了一个问题,并且有了一个函数,非常感谢

例如:

  df.groupby(['id','date'])['value'].unique() 

返回:

  id日期1 05/01/2017 15:56:00 [0.5]10/01/2017 15:45:00 [0.01,0.4]2 11/01/2017 15:22:00 [0.7,0.77]3 05/01/2017 09:37:00 [0.2,0.3]2017年6月1日11:02:00 [0.1]4 05/01/2017 09:55:00 [0.11,0.21]5 05/01/2017 10:08:00 [0.4]2017/05/01 10:09:00 [0.01]6 03/02/2017 08:55:00 [3.0]2017/03/02 09:15:00 [0.5]2017/03/02 09:31:00 [0.8]7 09/01/2017 15:42:00 [0.9]19/01/2017 16:34:00 [0.1] 

我希望能够选择(例如)组中可用的第二组,例如:

  1-2017年10月1日15:45:00 [0.01,0.04]3-06/01/2017 11:02:00 [0.1]5-05/01/2017 10:09:00 [0.01]6-03/02/2017 09:15:00 [0.5] 

任何帮助或想法都将不胜感激.

解决方案

您可以重置索引,然后使用 GroupBy + Praveen: Sorting and ranking by dates, on a group in a pandas df but this is incredibly slow: running on a 1.2M row dataset, it has so far taken over 20 mins on a JupyterHub server with 96gb ram, and is still going.

df1 = df.sort_values(['id', 'date'], ascending=[True, False])
df1['date_rank'] = df1.groupby(['id'])['date'].transform(lambda x: list(map(lambda y: dict(map(reversed, dict(enumerate(x.unique())).items()))[y]+1,x)) )

Current idea

Instead of labelling the test dates with a number, as was my first idea, it feels like this is close to achievable with cumcount, groupby.nth or getgroup - but I can't get there. Anybody have any ideas? The dataset below is still relevant:

df = pd.DataFrame({
'id':[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6,6,6,7,7], 
'value':[.01, .4, .5, .7, .77, .1,.2, 0.3, .11, .21, .4, .01, 3, .5, .8, .9, .1],
'date':['10/01/2017 15:45:00','10/01/2017 15:45:00','05/01/2017 15:56:00',
    '11/01/2017 15:22:00','11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00','05/01/2017 09:37:00','05/01/2017 09:55:00',
    '05/01/2017 09:55:00','05/01/2017 10:08:00','05/01/2017 10:09:00','03/02/2017 08:55:00',
    '03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
    '19/01/2017 16:34:00']})

And i'm trying to generate data like that in date_rank column; although i'm happy to not generate a column, i just want to be able to select rows containing the second test date data.

For example:

df.groupby(['id','date'])['value'].unique()

Returns:

id  date               
1   05/01/2017 15:56:00           [0.5]
    10/01/2017 15:45:00     [0.01, 0.4]
2   11/01/2017 15:22:00     [0.7, 0.77]
3   05/01/2017 09:37:00      [0.2, 0.3]
    06/01/2017 11:02:00           [0.1]
4   05/01/2017 09:55:00    [0.11, 0.21]
5   05/01/2017 10:08:00           [0.4]
    05/01/2017 10:09:00          [0.01]
6   03/02/2017 08:55:00           [3.0]
    03/02/2017 09:15:00           [0.5]
    03/02/2017 09:31:00           [0.8]
7   09/01/2017 15:42:00           [0.9]
    19/01/2017 16:34:00           [0.1]

I would want to be able to select (for example) the second group of a group where available e.g.:

1 - 10/01/2017 15:45:00 [0.01, 0.04]
3 - 06/01/2017 11:02:00 [0.1]
5 - 05/01/2017 10:09:00 [0.01]
6 - 03/02/2017 09:15:00 [0.5]

Any help or thoughts greatly appreciated.

解决方案

You can reset_index, then use GroupBy + nth:

res = df.reset_index().groupby('id').nth(1)

print(res)

                   date        value
id                                  
1   10/01/2017 15:45:00  [0.01, 0.4]
3   06/01/2017 11:02:00        [0.1]
5   05/01/2017 10:09:00       [0.01]
6   03/02/2017 09:15:00        [0.5]
7   19/01/2017 16:34:00        [0.1]

这篇关于groupby返回第n组-NOT行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆