删除对应于小于指定大小的组的行 [英] Drop rows corresponding to groups smaller than specified size
问题描述
我对100 questions_id
和50 user_id
的回答为DataFrame
.每行代表来自特定用户的单个问题.桌子看起来像这样.
I have a DataFrame
of answers for 100 questions_id
and 50 user_id
's. Each row represents a single question from a specific user. The table looks something like this.
user_id | question_id | timetaken | answer_1 | answer_2 |
1015 | 1 | 30 | A | C |
1015 | 2 | 45 | B | B |
1016 | 1 | 15 | A | A |
1016 | 2 | 55 | A | D |
我正试图筛选出未完成测试的用户.我的思考过程是计算每个用户出现在表中的次数,如果user_id
1015在user_id
列中出现100次,我知道他们完成了100个问题.不幸的是,由于问题是随机的,因此我无法使用question_id
进行过滤,因此用户可以回答5个问题,其中一个问题可能具有question_id = 100
.
I am trying to filter out the users that did not complete the test. My thought process to do this was to count the amount of occurrences each user appears in the table, if the user_id
1015 appears in the column user_id
100 times, I know they completed the 100 questions. Unfortunately, I cannot use the question_id
to filter as the questions are random so the user could answer 5 questions and one of them could have the question_id = 100
.
我以为这是我的解决方案,但是不能t计算出user_id
的出现次数.
I thought this was my solution but couldn't work out how to count the occurrences of user_id
.
推荐答案
使用 groupby
和 filter
,非常简洁,旨在达到此目的.
Use groupby
and filter
, very succinct and intended for this purpose.
df1 = df.groupby('user_id').filter(lambda x: len(x) > 100)
为获得更好的性能,请使用 map
:
For better performance, use np.unique
and map
:
m = dict(zip(*np.unique(df.user_id, return_counts=True)))
df[df['user_id'].map(m) > 100]
这篇关于删除对应于小于指定大小的组的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!