如何在 pandas 数据框中查找重复项 [英] How to find duplicates in pandas dataframe
问题描述
编辑.
假设我的熊猫系列如下:
Suppose I have the following series in pandas:
>>>p
0 0.0
1 0.0
2 0.0
3 0.3
4 0.3
5 0.3
6 0.3
7 0.3
8 1.0
9 1.0
10 1.0
11 0.2
12 0.2
13 0.3
14 0.3
15 0.3
我需要确定连续重复的每个序列-它的第一个和最后一个索引.使用上面的示例,我需要独立于最后一个0.3序列(从索引13到15)来标识第一个0.3序列(从索引3到7).
I need to identify each sequence of consecutive duplicates - its first and last index. Using the above example, I need to identify the first sequence of 0.3 (from index 3 to 7) independently from the last sequence of 0.3 (from index 13 to 15).
使用Series.duplicated是不够的,因为:
Using Series.duplicated is insufficient because:
* using keep ='first'将所有重复项的所有第一个实例标记为False,但由于它不是0.3的第一个出现,因此会将索引13保留为True.
*using keep='first' marks all first instances of duplicates False, but will leave index 13 as True because it is not the first appearance of 0.3.
* keep ='last'
*Same goes for keep='last'
* keep = False只是将所有条目标记为True.
*keep=False just marks all of the entries as True.
谢谢!
推荐答案
我相信需要进行比较 ne
与 drop_duplicates
:
I believe need trick with compare shift
ed values for not equal by ne
with cumsum
and last drop_duplicates
:
s = df['a'].ne(df['a'].shift()).cumsum()
a = s.drop_duplicates().index
b = s.drop_duplicates(keep='last').index
df = pd.DataFrame({'first':a, 'last':b})
print (df)
first last
0 0 2
1 3 7
2 8 10
3 11 12
4 13 15
If want also duplicated value to new column a bit change solution with duplicated
:
s = df['a'].ne(df['a'].shift()).cumsum()
a = df.loc[~s.duplicated(), 'a']
b = s.drop_duplicates(keep='last')
df = pd.DataFrame({'first':a.index, 'last':b.index, 'val':a})
print (df)
first last val
0 0 2 0.0
3 3 7 0.3
8 8 10 1.0
11 11 12 0.2
13 13 15 0.3
如果需要新列:
If need new column:
df['count'] = df['a'].ne(df['a'].shift()).cumsum()
print (df)
a count
0 0.0 1
1 0.0 1
2 0.0 1
3 0.3 2
4 0.3 2
5 0.3 2
6 0.3 2
7 0.3 2
8 1.0 3
9 1.0 3
10 1.0 3
11 0.2 4
12 0.2 4
13 0.3 5
14 0.3 5
15 0.3 5
这篇关于如何在 pandas 数据框中查找重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!