Python Pandas-如何分组关闭元素 [英] Python pandas - how to group close elements

查看:81
本文介绍了Python Pandas-如何分组关闭元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我需要将距离不超过1的元素分组.例如,如果这是我的df:

  group_number val0 1 51 1 82 1 123 1 134 1 225 1 266 1 317 2 78 2 169 2 1710 2 1911 2 2912 2 3313 2 62 

因此,我需要将 val 的值小于或等于1的 group_number val 进行分组.

因此,在此示例中,行 2 3 将组合在一起,行 8 9 会聚在一起.

我尝试使用diff或相关函数,但是我没有弄清楚.

任何帮助将不胜感激!

解决方案

使用 diff 是正确的方法-只需将其与 gt cumsum组合,您就有了自己的小组.

这个想法是对大于您的阈值的差异使用累计和.大于阈值的差异将变为 True .相反,等于或低于阈值的差异将变为 False .累计布尔值将使等于或低于阈值的差异保持不变,因此它们具有相同的组号.

  max_distance = 1df ["group_diff"] = df.sort_values("val")\.groupby("group_number")["val"] \.diff()\.gt(max_distance)\.cumsum()打印(df)group_number val group_diff0 1 5 01 1 8 12 1 12 23 1 13 24 1 22 55 1 26 66 1 31 87 2 7 08 2 16 39 2 17 310 2 19 411 2 29 712 2 33 913 2 62 10 

您现在可以在 group_number group_diff 上使用 groupby ,并使用以下内容查看生成的组:

  grouped = df.groupby(["group_number","group_diff"])打印(grouped.groups){(1,0):Int64Index([0],dtype ='int64'),(1,1):Int64Index([1],dtype ='int64'),(1,2):Int64Index([2,3],dtype ='int64'),(1,5):Int64Index([4],dtype ='int64'),(1,6):Int64Index([5],dtype ='int64'),(1,8):Int64Index([6],dtype ='int64'),(2,0):Int64Index([7],dtype ='int64'),(2,3):Int64Index([8,9],dtype ='int64'),(2,4):Int64Index([10],dtype ='int64'),(2,7):Int64Index([11],dtype ='int64'),(2,9):Int64Index([12],dtype ='int64'),(2,10):Int64Index([13],dtype ='int64')} 

感谢@jezrael避免使用新列来提高性能的提示:

  group_diff = df.sort_values("val")\.groupby("group_number")["val"] \.diff()\.gt(max_distance)\.cumsum()分组= df.groupby(["group_number",group_diff]) 

I have a dataframe where I need to group elements with distance of no more than 1. For example, if this is my df:

     group_number  val
0              1    5
1              1    8
2              1   12
3              1   13
4              1   22
5              1   26
6              1   31
7              2    7
8              2   16
9              2   17
10             2   19
11             2   29
12             2   33
13             2   62

So I need to group both by the group_number and val where the values of val are smaller than or equal to 1.

So, in this example, lines 2 and 3 would group together, and also lines 8 and 9 would group together.

I tried using diff or related functions, but I didn't figure it out.

Any help will be appreciated!

解决方案

Using diff is the right approach - just combine it with gt and cumsum and you have your groups.

The idea is to use cumulative sum for differences bigger than your threshold. Difference larger than your threshold will become True. In contrast, differences equal or lower to your threshold will become False. Cumulatively summing over the boolean values will leave differences equal or lower to your threshold unchanged and hence they get the same group number.

max_distance = 1

df["group_diff"] = df.sort_values("val")\
                     .groupby("group_number")["val"]\
                     .diff()\
                     .gt(max_distance)\
                     .cumsum()

print(df)

    group_number    val group_diff
0   1               5   0
1   1               8   1
2   1               12  2
3   1               13  2
4   1               22  5
5   1               26  6
6   1               31  8
7   2               7   0
8   2               16  3
9   2               17  3
10  2               19  4
11  2               29  7
12  2               33  9
13  2               62  10

You can now use groupby on group_number and group_diff and see the resulting groups with the following:

grouped = df.groupby(["group_number", "group_diff"])
print(grouped.groups)

{(1, 0): Int64Index([0], dtype='int64'),
 (1, 1): Int64Index([1], dtype='int64'),
 (1, 2): Int64Index([2, 3], dtype='int64'),
 (1, 5): Int64Index([4], dtype='int64'),
 (1, 6): Int64Index([5], dtype='int64'),
 (1, 8): Int64Index([6], dtype='int64'),
 (2, 0): Int64Index([7], dtype='int64'),
 (2, 3): Int64Index([8, 9], dtype='int64'),
 (2, 4): Int64Index([10], dtype='int64'),
 (2, 7): Int64Index([11], dtype='int64'),
 (2, 9): Int64Index([12], dtype='int64'),
 (2, 10): Int64Index([13], dtype='int64')}

Thanks @jezrael for the hint of avoiding a new column to increase performance:

group_diff = df.sort_values("val")\
               .groupby("group_number")["val"]\
               .diff()\
               .gt(max_distance)\
               .cumsum()

grouped = df.groupby(["group_number", group_diff])

这篇关于Python Pandas-如何分组关闭元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆