如何根据条件/分组从另一列中删除一列中的连续重复行? [英] How can I drop consecutive duplicate rows in one column based on condition/grouping from another column?

查看:78
本文介绍了如何根据条件/分组从另一列中删除一列中的连续重复行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大型数据框(约1万行),前几行看起来就像我称之为df_a:

I have large dataframe (approx. 10k rows) with the first few rows looking like what I'll call df_a:

logtime             | zone  | value   
01/01/2017 06:05:00 | 0     | 14.5
01/01/2017 06:05:00 | 1     | 14.5
01/01/2017 06:05:00 | 2     | 17.0
01/01/2017 06:25:00 | 0     | 14.5
01/01/2017 06:25:00 | 1     | 14.5
01/01/2017 06:25:00 | 2     | 10.0
01/01/2017 06:50:00 | 0     | 10.0
01/01/2017 06:50:00 | 1     | 10.0
01/01/2017 06:50:00 | 2     | 10.0
01/01/2017 07:50:00 | 0     | 14.5
01/01/2017 07:50:00 | 1     | 14.5
01/01/2017 07:50:00 | 2     | 14.5
etc.

我希望删除连续重复项,以便只留下有关区域如何变化的信息.例如,如果区域1在两个日志时间内位于14.5,则重复项将被删除,直到更改为10.0.这样我就得到了一个像这样的数据框:

I am looking to drop consecutive duplicates, so that I am only left with information about how zones change. For example, if zone 1 is at 14.5 over two logtimes, the duplicate is removed until it changes to 10.0. So that I'm left with a dataframe like:

logtime             | zone  | value   
01/01/2017 06:05:00 | 0     | 14.5
01/01/2017 06:05:00 | 1     | 14.5
01/01/2017 06:05:00 | 2     | 17.0
01/01/2017 06:25:00 | 2     | 10.0
01/01/2017 06:50:00 | 0     | 10.0
01/01/2017 06:50:00 | 1     | 10.0
01/01/2017 07:50:00 | 0     | 14.5
01/01/2017 07:50:00 | 1     | 14.5
01/01/2017 07:50:00 | 2     | 14.5
etc.

我的理解是drop_duplicates将仅保留唯一值,因此这对我的目标不起作用.

My understanding is that drop_duplicates will only retain unique values, so this doesn't work for my aim.

我还尝试使用.loc和shift方法:

I also tried to use a .loc and shift method:

removeduplicates = df.loc[ (df.logtime != df.logtime.shift(1)) | (df.zone != df.zone.shift(1)) | (df.value != df.value.shift(1))]

但是,这不会失败也不起作用,无法获得所需的输出.谢谢!

However, this doesn't fail nor does it work to get the desired output. Thanks!

推荐答案

您可以创建一个布尔掩码,其中每组区域的连续值之间的差异不等于0:

you can create a Boolean mask where the diff between successive values per group of zone is not equal to 0:

print (df[df.groupby(['zone']).value.diff().ne(0)])
                logtime  zone  value
0  01/01/2017 06:05:00      0   14.5
1  01/01/2017 06:05:00      1   14.5
2  01/01/2017 06:05:00      2   17.0
5  01/01/2017 06:25:00      2   10.0
6  01/01/2017 06:50:00      0   10.0
7  01/01/2017 06:50:00      1   10.0

这篇关于如何根据条件/分组从另一列中删除一列中的连续重复行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆