相当于Python中的R group_by()+ rleid() [英] R group_by() + rleid() equivalent in Python

查看:135
本文介绍了相当于Python中的R group_by()+ rleid()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python中有以下数据框:

I've got a following data frame in Python:

df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
                         'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]), 
                                                np.repeat([1, 2, 3], [2, 2, 2])]),
                         'obj': list('AB' * 6),
                         'var': [1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1]})

首先,在由object定义的每个组中,我想将ID分配给measurement_idvar列的唯一运行.如果这些列的任何值发生更改,它将启动应指定新ID的新运行.

First, within each group defined by object, I'd like to assign id to unique run of measurement_id and var columns. If any value of those columns changes, it starts new run that should be assigned with new id. So the

df['rleid_output'] = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 3]

然后,对于由rleid_output定义的每个组,我想检查运行持续了多少分钟(min列),给了我expected_output列:

Then, for each group defined by rleid_output I'd like to check how many minutes (min column) the run lasted giving me expected_output column:

df['expected_output'] = [2, 2, 2, 2, 1, 1, 2, 3, 2, 3, 1, 3]

如果是R,我将按照以下步骤操作:

If it was R, I'd proceed as follows:

df <- data.frame(measurement_id = rep(1:2, each = 6),
           min = rep(rep(1:3, each = 2), 2),
           object = rep(LETTERS[1:2], 6),
           var = c(1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1))
df %>% 
  group_by(object) %>% 
  mutate(rleid = data.table::rleid(measurement_id, var)) %>% 
  group_by(object, rleid) %>% 
  mutate(expected_output = last(min) - first(min) + 1) 

所以我需要的主要是与Python pd.DataFrame.groupby子句一起工作的R data.table::rleid等效项.有什么想法可以解决这个问题吗?

So the main thing I need is R data.table::rleid equivalent that would work with Python pd.DataFrame.groupby clause. Any ideas how to solve this?

@Edit:数据框的新的更新示例:

@ new, updated example of data frame:

df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
                         'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]), 
                                                np.repeat([1, 2, 3], [2, 2, 2])]),
                         'obj': list('AB' * 6),
                         'var': [1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1]})
df['rleid_output'] = [1, 1, 2, 1, 3, 2, 4, 3, 4, 3, 5, 3]
df['expected_output'] = [1, 2, 1, 2, 1, 1, 2, 3, 2, 3, 1, 3]

推荐答案

更新后的答案

问题在于,每个measurement_id, obj, var组中的min列都应保持顺序.我们可以在measurement_id, obj, var上按组进行检查,然后检查min列中的差异是否大于1.如果是这样,我们会在expected_output中将其标记为唯一的持续时间:

The problem is that the min column in each group of measurement_id, obj, var should be maintained order. We can check this by group by on measurement_id, obj, var and then checking if the difference in min column is greater than 1. If so, we mark it as a unique duration in expected_output:

df['grouper'] = (df.groupby(['measurement_id', 'obj', 'var'])['min']
                 .apply(lambda x: x.diff().fillna(1).eq(1))
                )

df['expected_output'] = (
    df.groupby(['measurement_id', 'obj', 'var'])['grouper'].transform('sum').astype(int)
)

df = df.drop(columns='grouper')

    measurement_id  min obj  var  expected_output
0                1    1   A    1                1
1                1    1   B    2                2
2                1    2   A    2                1
3                1    2   B    2                2
4                1    3   A    1                1
5                1    3   B    1                1
6                2    1   A    2                2
7                2    1   B    1                3
8                2    2   A    2                2
9                2    2   B    1                3
10               2    3   A    1                1
11               2    3   B    1                3


遵循OP的逻辑的旧答案

我们可以通过使用GroupBy.diff来获取您的rleid_output,基本上是每次measurement_id每次更改var时唯一的标识符. obj

We can achieve this by using GroupBy.diff to get your rleid_output, basically a unique identifier each time var changes for each measurement_id& obj

之后,使用GroupBy.nunique测量minutes的量:

rleid_output = df.groupby(['measurement_id', 'obj'])['var'].diff().abs().bfill()
df['expected_output'] = (df.groupby(['measurement_id', 'obj', rleid_output])['min']
                         .transform('nunique'))

    measurement_id  min obj  var  expected_output
0                1    1   A    1                2
1                1    1   B    2                2
2                1    2   A    1                2
3                1    2   B    2                2
4                1    3   A    2                1
5                1    3   B    1                1
6                2    1   A    2                2
7                2    1   B    1                3
8                2    2   A    2                2
9                2    2   B    1                3
10               2    3   A    1                1
11               2    3   B    1                3

这篇关于相当于Python中的R group_by()+ rleid()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆