如何在 pandas 数据框中保留前两个重复项? [英] How to keep first two duplicates in a pandas dataframe?

查看:155
本文介绍了如何在 pandas 数据框中保留前两个重复项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题要查找数据框中的重复项,并使用特定列删除数据框中的重复项. 这是我要完成的工作:

I have a question in regards to finding duplicates in a dataframe, and removing duplicates in a dataframe using a specific column. Here is what I am trying to accomplish:

是否可以删除重复项但保留前两个?

Is it possible to remove duplicates but keep the first 2?

这是我当前的数据框df的示例,请看一下我在下面放置的方括号内的注释,以使您有所了解.

Here is an example of my current dataframe called df and take a look at the bracket notes I have placed below to give you an idea.

注意:如果'Roll'= 1,那么我想查看Date列,看看该列中是否还有第二个重复的Date ...保留这两个并删除其他任何日期.

Note: If 'Roll' = 1 then I want to look at the Date column, see if there is a second duplicate Date in that column... keep those two and delete any others.

    Date    Open    High     Low      Close  Roll  Dupes
1  19780106  236.00  237.50  234.50  235.50     0    NaN
2  19780113  235.50  239.00  235.00  238.25     0    NaN
3  19780120  238.00  239.00  234.50  237.00     0    NaN
4  19780127  237.00  238.50  235.50  236.00     1    NaN (KEEP)  
5  19780203  236.00  236.00  232.25  233.50     0    NaN (KEEP)
6  19780127  237.00  238.50  235.50  236.00     0    NaN (KEEP)
7  19780203  236.00  236.00  232.25  233.50     0    NaN (DELETE)
8  19780127  237.00  238.50  235.50  236.00     0    NaN (DELETE)
9  19780203  236.00  236.00  232.25  233.50     0    NaN (DELETE)

这是当前正在删除重复对象的东西,但正在删除所有重复对象(显然)

This is what is currently removing the dupes BUT it's removing all dupes (obviously)

df = df.drop_duplicates('Date')

我忘了提些什么,我想保留的唯一重复项是"Roll"列是否为1 如果有,则保留该行以及根据日期"列

I forgot to mention something, the only duplicate I want to keep is if column 'Roll' = 1 if it does, then keep that row and the next one that matches based on column 'Date'

推荐答案

假设Roll只能采用0和1,如果您这样做

Assuming Roll can only take the values 0 and 1, if you do

df.groupby(['Date', 'Roll'], as_index=False).first() 

对于其中一行包含Roll = 1的日期,您将获得两行;对于仅具有Roll = 0的日期,将仅获得一行,我认为这正是您想要的.
如果通过as_index=False传递,则组键不会像您的注释中所讨论的那样最终出现在索引中.

you will get two rows for dates for which one of the rows had Roll = 1 and only one row for dates which have only Roll = 0, which I think is what you want.
If passed as_index=False so that the group keys don't end up in the index as discussed in your comment.

这篇关于如何在 pandas 数据框中保留前两个重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆