在 Python Dataframe 中对附近的列值进行分组 [英] Group nearby column values in Python Dataframe

查看:67
本文介绍了在 Python Dataframe 中对附近的列值进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 DataFrame 有一些列,比如n"列和一些行,比如m"行.我想根据一列(列:'x')值对 DataFrame 行进行分组,它不是列 'x' 值的完全匹配.我需要对附近的值进行分组.例如我的 DataFrame 是这样的:

I have a DataFrame with some columns, say 'n' columns and some rows, say 'm' row. I want to group DataFrame rows depending on one column(Column:'x') values, Its not the exact match of column 'x' values. I need to group out nearby values. For example my DataFrame would be like this:

      y    yh     x    xw       w   Nxt
0   2987  3129   347  2092  1735.0   501
1   2715  2847   501  1725  1224.0   492
2   2419  2716   490  2196  1704.0   492
3   2310  2373   492   794   302.0   886
4   2309  2370   886  1012   126.0   492
5   2198  2261   497   791   299.0   886
6   2197  2258   886  1010   124.0   492
7   1663  2180   375  1092   600.0  1323

在上面的数据帧中,列 'x' 值之间的差异在 20 之间,然后我需要将它们分组到一个新的数据帧中,其余的可以避免.这里 index=1,2,3,5 行可以是一组,而 index=4,6 是另一组,因为这些行之间的差异x"列在 20 之间.我的预期输出应该是三个数据帧- df1:one 保存所有分组行和 df2:保存另一组行和 'df3': 其余行如下:

In above dataframe difference between column 'x' values is between 20 then i need to group them into a new dataframe and rest of them can be avoided. Here the index=1,2,3,5 rows can be a one group and index=4,6 would be another group, because difference between those rows 'x' column is between 20. My expected output should be three dataframes- df1:one holds all grouped rows and df2:holds another group of rows and 'df3': rest of the rows as follows:

df1:

      y    yh     x    xw       w   Nxt
1   2715  2847   501  1725  1224.0   492
2   2419  2716   490  2196  1704.0   492
3   2310  2373   492   794   302.0   886
5   2198  2261   497   791   299.0   886

df2:

      y    yh     x    xw       w   Nxt
4   2309  2370   886  1012   126.0   492
6   2197  2258   886  1010   124.0   492

df3:

    y    yh     x    xw       w   Nxt
0   2987  3129   347  2092  1735.0   501
7   1663  2180   375  1092   600.0  1323

我尝试了 Groupby-apply 和 groupby-transform,但没有成功.如果有人能帮我得到这个预期的东西,那将是非常有帮助的,提前致谢.

I tried with Groupby-apply and groupby-transform but couldn't succeed. It would be great help if any one can help me to get this expected one, thanks in advance.

推荐答案

为了将 'x' 列中的值分组在 20 以内,您可以使用 shift 并创建一个列命名为 'group' 以定位两行之间的所有空间都在 20 以上的位置,一旦值按 'x' 排序.

In order to group the value in the column 'x' within 20, what you can do is using shift and create a column named 'group' to locate where all the space between two rows is above 20, once values are sorted by 'x'.

df = df.sort_values('x')
df.loc[(df.x.shift() < df.x - 20),'group'] = 1 # everytime the jump betweeen two row is more than 20
# use cumsum, ffill and fillna to complete the column group and have a different number for each one
df['group'] = df['group'].cumsum().ffill().fillna(0)
#if the order of indexes matters, you can here add df = df.sort_index() and the code after is the same

通过您的输入,您将获得:

With your input, you get:

      y    yh    x    xw       w   Nxt  group
0  2987  3129  347  2092  1735.0   501    0.0
7  1663  2180  375  1092   600.0  1323    1.0
2  2419  2716  490  2196  1704.0   492    2.0
3  2310  2373  492   794   302.0   886    2.0
5  2198  2261  497   791   299.0   886    2.0
1  2715  2847  501  1725  1224.0   492    2.0
4  2309  2370  886  1012   126.0   492    3.0
6  2197  2258  886  1010   124.0   492    3.0

现在,当组中有不止一行时,您可以为每个组创建一个数据框列表.您需要在 'x' 上使用 groupbyfilter 长度大于 1 的组.最后,将所有长度为 1 的组添加为一个数据帧:

Now, you can create a list of dataframe for each group when there is more than one row in the group. You need to use groupby on 'x', filter the group with a length more than 1. At the end, add all the group with a length one as one dataframe:

list_df = [df_g for name_g, df_g in df.groupby('group').filter(lambda x: len(x)>1).groupby('group')] +\
            [df.groupby('group').filter(lambda x: len(x)==1)]

,例如,您最终将列表的每个元素都作为您想要的数据框之一.

and you ends up with each element of the list being one of the dataframe you want, for example.

print (list_df [0])
      y    yh    x    xw       w  Nxt  group
2  2419  2716  490  2196  1704.0  492    2.0
3  2310  2373  492   794   302.0  886    2.0
5  2198  2261  497   791   299.0  886    2.0
1  2715  2847  501  1725  1224.0  492    2.0

print (list_df [-1])
      y    yh    x    xw       w   Nxt  group
0  2987  3129  347  2092  1735.0   501    0.0
7  1663  2180  375  1092   600.0  1323    1.0

我看到您想要为每个名称命名,但我认为如果它们在列表中会更容易访问

I see you want a name for each one, but I think it will be easier to access them if they are in a list

这篇关于在 Python Dataframe 中对附近的列值进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆