根据行数切片 pandas 数据帧 [英] Slicing Pandas Dataframe according to number of lines

查看:68
本文介绍了根据行数切片 pandas 数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想这是一件很简单的事,但是我找不到如何做到这一点。我一直在搜索教程和stackoverflow。



假设我有一个这样的数据框df:

 组Id_In_Group SomeQuantity 
1 1 10
1 2 20
2 1 7
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
4 1 1
4 2 18
4 3 14
4 4 7
5 1 36

我想只选择组中至少有4个对象的行(所以在那里至少有4行具有相同的组数字),并且对于第4个对象的SomeQuantity,当通过升序SomeQuantity在组中排序时,大于20(例如)。



在给定的Dataframe中,例如,只会返回第3组,因为i t有4(> = 4)个成员,其第4个SomeQuantity(排序后)是22(> = 20),因此应该构建数据框:

 组Id_In_Group SomeQuantity 
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28

(有或没有按SomeQuantity排序,无论如何)。



有人可以帮我吗? :

解决方案

使用 map value_counts groupby 过滤器

 (df [df.Group.map(df.Group.value_counts()。ge(4))] 
.groupby('Group')
.filter(lambda x:np.any(x ['SomeQuantity']。sort_values()。iloc [3]> = 20)))






步骤细目:



执行 value_counts 来计算列中存在的不同元素的总计数。

 >>> df.Group.value_counts()

3 5
4 4
1 2
5 1
2 1
名称:Group,dtype :int64

使用 map 字典(其中索引成为键并且序列元素成为值)将这些结果映射回原始的 DF

 >>> df.Group.map(df.Group.value_counts())

0 2
1 2
2 1
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
12 1
名称:组,dtype:int64

然后,我们检查值为4以上的元素这是我们的阈值限制,仅从整个 DF 中的子集。

 >>> df [df.Group.map(df.Group.value_counts()。ge(4))] 

组Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
8 4 1 1
9 4 2 28
10 4 3 14
11 4 4 7

为了使用 groupby.filter 操作,我们必须确保我们返回一个对应于每个分组键的单个布尔值,当我们执行排序过程并将第四个元素与20的阈值进行比较。
np.any 返回与我们的过滤器匹配的所有这些可能性。

 >>> df [df.Group.map(df.Group.value_counts()。ge(4))] \ 
.groupby('Group')。apply(lambda x:x ['SomeQuantity']。sort_values ).iloc [3])


3 22
4 18
dtype:int64

从这些,我们比较第四个元素 .iloc [3] ,因为它是基于0的索引,并返回所有这样有利的比赛。


I suppose this is something rather simple, but I can't find how to make this. I've been searching tutorials and stackoverflow.

Suppose I have a dataframe df loking like this :

Group   Id_In_Group   SomeQuantity
1        1              10
1        2              20
2        1               7
3        1              16
3        2              22
3        3               5
3        4              12
3        5              28
4        1               1
4        2              18
4        3              14
4        4               7
5        1              36

I would like to select only the lines having at least 4 objects in the group (so there are at least 4 rows having the same "group" number) and for which SomeQuantity for the 4th object, when sorted in the group by ascending SomeQuantity, is greater than 20 (for example).

In the given Dataframe, for example, it would only return the 3rd group, since it has 4 (>=4) members and its 4th SomeQuantity (after sorting) is 22 (>=20), so it should construct the dataframe :

Group   Id_In_Group   SomeQuantity
3        1              16
3        2              22
3        3               5
3        4              12
3        5              28

(being or not sorted by SomeQuantity, whatever).

Could somebody be kind enough to help me? :)

解决方案

A slightly different approach using map, value_counts, groupby , filter:

(df[df.Group.map(df.Group.value_counts().ge(4))]
   .groupby('Group')
   .filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))


Breakdown of steps:

Perform value_counts to compute the total counts of distinct elements present in Group column.

>>> df.Group.value_counts()

3    5
4    4
1    2
5    1
2    1
Name: Group, dtype: int64

Use map which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF

>>> df.Group.map(df.Group.value_counts())

0     2
1     2
2     1
3     5
4     5
5     5
6     5
7     5
8     4
9     4
10    4
11    4
12    1
Name: Group, dtype: int64

Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF.

>>> df[df.Group.map(df.Group.value_counts().ge(4))]   

    Group  Id_In_Group  SomeQuantity
3       3            1            16
4       3            2            22
5       3            3             5
6       3            4            12
7       3            5            28
8       4            1             1
9       4            2            28
10      4            3            14
11      4            4             7

Inorder to use groupby.filter operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20. np.any returns all such possiblities matching our filter.

>>> df[df.Group.map(df.Group.value_counts().ge(4))]         \
      .groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])

 Group
3    22
4    18
dtype: int64

From these, we compare the fourth element .iloc[3] as it is 0-based indexed and return all such favourable matches.

这篇关于根据行数切片 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆