如何在 pandas 数据框中保留前两个重复项? [英] How to keep first two duplicates in a pandas dataframe?
问题描述
我有一个问题要查找数据框中的重复项,并使用特定列删除数据框中的重复项. 这是我要完成的工作:
I have a question in regards to finding duplicates in a dataframe, and removing duplicates in a dataframe using a specific column. Here is what I am trying to accomplish:
是否可以删除重复项但保留前两个?
Is it possible to remove duplicates but keep the first 2?
这是我当前的数据框df的示例,请看一下我在下面放置的方括号内的注释,以使您有所了解.
Here is an example of my current dataframe called df and take a look at the bracket notes I have placed below to give you an idea.
注意:如果'Roll'= 1,那么我想查看Date列,看看该列中是否还有第二个重复的Date ...保留这两个并删除其他任何日期.
Note: If 'Roll' = 1 then I want to look at the Date column, see if there is a second duplicate Date in that column... keep those two and delete any others.
Date Open High Low Close Roll Dupes
1 19780106 236.00 237.50 234.50 235.50 0 NaN
2 19780113 235.50 239.00 235.00 238.25 0 NaN
3 19780120 238.00 239.00 234.50 237.00 0 NaN
4 19780127 237.00 238.50 235.50 236.00 1 NaN (KEEP)
5 19780203 236.00 236.00 232.25 233.50 0 NaN (KEEP)
6 19780127 237.00 238.50 235.50 236.00 0 NaN (KEEP)
7 19780203 236.00 236.00 232.25 233.50 0 NaN (DELETE)
8 19780127 237.00 238.50 235.50 236.00 0 NaN (DELETE)
9 19780203 236.00 236.00 232.25 233.50 0 NaN (DELETE)
这是当前正在删除重复对象的东西,但正在删除所有重复对象(显然)
This is what is currently removing the dupes BUT it's removing all dupes (obviously)
df = df.drop_duplicates('Date')
我忘了提些什么,我想保留的唯一重复项是"Roll"列是否为1 如果有,则保留该行以及根据日期"列
I forgot to mention something, the only duplicate I want to keep is if column 'Roll' = 1 if it does, then keep that row and the next one that matches based on column 'Date'
推荐答案
假设Roll
只能采用0和1,如果您这样做
Assuming Roll
can only take the values 0 and 1, if you do
df.groupby(['Date', 'Roll'], as_index=False).first()
对于其中一行包含Roll = 1
的日期,您将获得两行;对于仅具有Roll = 0
的日期,将仅获得一行,我认为这正是您想要的.
如果通过as_index=False
传递,则组键不会像您的注释中所讨论的那样最终出现在索引中.
you will get two rows for dates for which one of the rows had Roll = 1
and only one row for dates which have only Roll = 0
, which I think is what you want.
If passed as_index=False
so that the group keys don't end up in the index as discussed in your comment.
这篇关于如何在 pandas 数据框中保留前两个重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!