根据行数切片 pandas 数据帧 [英] Slicing Pandas Dataframe according to number of lines
问题描述
假设我有一个这样的数据框df:
组Id_In_Group SomeQuantity
1 1 10
1 2 20
2 1 7
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
4 1 1
4 2 18
4 3 14
4 4 7
5 1 36
我想只选择组中至少有4个对象的行(所以在那里至少有4行具有相同的组数字),并且对于第4个对象的SomeQuantity,当通过升序SomeQuantity在组中排序时,大于20(例如)。
在给定的Dataframe中,例如,只会返回第3组,因为i t有4(> = 4)个成员,其第4个SomeQuantity(排序后)是22(> = 20),因此应该构建数据框:
组Id_In_Group SomeQuantity
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
(有或没有按SomeQuantity排序,无论如何)。
有人可以帮我吗? :
使用 map
, value_counts
, groupby
,过滤器
:
(df [df.Group.map(df.Group.value_counts()。ge(4))]
.groupby('Group')
.filter(lambda x:np.any(x ['SomeQuantity']。sort_values()。iloc [3]> = 20)))
步骤细目:
执行 value_counts
来计算组列中存在的不同元素的总计数。
>>> df.Group.value_counts()
3 5
4 4
1 2
5 1
2 1
名称:Group,dtype :int64
使用 map
字典(其中索引成为键并且序列元素成为值)将这些结果映射回原始的 DF
>>> df.Group.map(df.Group.value_counts())
0 2
1 2
2 1
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
12 1
名称:组,dtype:int64
然后,我们检查值为4以上的元素这是我们的阈值限制,仅从整个 DF
中的子集。
>>> df [df.Group.map(df.Group.value_counts()。ge(4))]
组Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
8 4 1 1
9 4 2 28
10 4 3 14
11 4 4 7
为了使用 groupby.filter
操作,我们必须确保我们返回一个对应于每个分组键的单个布尔值,当我们执行排序过程并将第四个元素与20的阈值进行比较。
np.any
返回与我们的过滤器匹配的所有这些可能性。
>>> df [df.Group.map(df.Group.value_counts()。ge(4))] \
.groupby('Group')。apply(lambda x:x ['SomeQuantity']。sort_values ).iloc [3])
组
3 22
4 18
dtype:int64
从这些,我们比较第四个元素 .iloc [3]
,因为它是基于0的索引,并返回所有这样有利的比赛。
I suppose this is something rather simple, but I can't find how to make this. I've been searching tutorials and stackoverflow.
Suppose I have a dataframe df loking like this :
Group Id_In_Group SomeQuantity
1 1 10
1 2 20
2 1 7
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
4 1 1
4 2 18
4 3 14
4 4 7
5 1 36
I would like to select only the lines having at least 4 objects in the group (so there are at least 4 rows having the same "group" number) and for which SomeQuantity for the 4th object, when sorted in the group by ascending SomeQuantity, is greater than 20 (for example).
In the given Dataframe, for example, it would only return the 3rd group, since it has 4 (>=4) members and its 4th SomeQuantity (after sorting) is 22 (>=20), so it should construct the dataframe :
Group Id_In_Group SomeQuantity
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
(being or not sorted by SomeQuantity, whatever).
Could somebody be kind enough to help me? :)
A slightly different approach using map
, value_counts
, groupby
, filter
:
(df[df.Group.map(df.Group.value_counts().ge(4))]
.groupby('Group')
.filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))
Breakdown of steps:
Perform value_counts
to compute the total counts of distinct elements present in Group column.
>>> df.Group.value_counts()
3 5
4 4
1 2
5 1
2 1
Name: Group, dtype: int64
Use map
which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF
>>> df.Group.map(df.Group.value_counts())
0 2
1 2
2 1
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
12 1
Name: Group, dtype: int64
Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF
.
>>> df[df.Group.map(df.Group.value_counts().ge(4))]
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
8 4 1 1
9 4 2 28
10 4 3 14
11 4 4 7
Inorder to use groupby.filter
operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20.
np.any
returns all such possiblities matching our filter.
>>> df[df.Group.map(df.Group.value_counts().ge(4))] \
.groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])
Group
3 22
4 18
dtype: int64
From these, we compare the fourth element .iloc[3]
as it is 0-based indexed and return all such favourable matches.
这篇关于根据行数切片 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!