pandas :如何为数据框中至少出现n次的重复项过滤数据框 [英] Pandas: How to filter dataframe for duplicate items that occur at least n times in a dataframe
问题描述
我有一个Pandas DataFrame,其中包含重复的条目;有些项目被列出两次或三次.我想过滤它,使其仅显示至少列出n次的项目:
I have a Pandas DataFrame that contains duplicate entries; some items are listed twice or three times. I would like to filter it so that it only shows items that are listed at least n times:
- DataFrame包含3列:['colA','colB','colC'].在确定是否多次列出该项目时,应仅考虑"colB".
- 注意:这不是
drop_duplicates()
.相反,我想删除数据框中少于n次的项目. - 最终结果应仅将每个项目列出一次.
- the DataFrame contains 3 columns: ['colA', 'colB', 'colC']. It should only consider 'colB' in determining whether the item is listed multiple times.
- Note: this is not
drop_duplicates()
. It's the opposite, I would like to drop items that are in the dataframe less than n times. - The end result should list each item only once.
推荐答案
您可以使用 isin
:
You can use value_counts
to get the item count and then construct a boolean mask from this and reference the index and test membership using isin
:
In [3]:
df = pd.DataFrame({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]})
df
Out[3]:
a
0 0
1 0
2 0
3 1
4 2
5 2
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
In [8]:
df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)]
Out[8]:
a
0 0
1 0
2 0
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
因此,请打破上面的内容:
So breaking the above down:
In [9]:
df['a'].value_counts() > 2
Out[9]:
3 True
4 True
0 True
2 False
1 False
Name: a, dtype: bool
In [10]:
# construct a boolean mask
df['a'].value_counts()[df['a'].value_counts()>2]
Out[10]:
3 6
4 3
0 3
Name: a, dtype: int64
In [11]:
# we're interested in the index here, pass this to isin
df['a'].value_counts()[df['a'].value_counts()>2].index
Out[11]:
Int64Index([3, 4, 0], dtype='int64')
编辑
@JonClements用户建议一种更简单,更快捷的方法是在感兴趣的列上 groupby
和
As user @JonClements suggested a simpler and faster method would be to groupby
on the col of interest and filter
it:
In [4]:
df.groupby('a').filter(lambda x: len(x) > 2)
Out[4]:
a
0 0
1 0
2 0
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
编辑2
对于每个重复调用仅获取一个条目 drop_duplicates
并传递参数 subset ='a'
:
To get just a single entry for each repeat call drop_duplicates
and pass param subset='a'
:
In [2]:
df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a')
Out[2]:
a
0 0
6 3
12 4
这篇关于 pandas :如何为数据框中至少出现n次的重复项过滤数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!