Pandas DataFrame组以可变长度重叠的间隔 [英] Pandas DataFrame groupby overlapping intervals of variable length
问题描述
我正在尝试将DataFrame分组为2列(参见下面的示例)。
对于第一列,我希望每个值属于一个组。对于第二列,我想通过重叠不等大小的间隔进行分组。
I am trying to group a DataFrame by 2 columns (see example below). For the first column, I want each value to belong to a group. For the second column, I want to group by overlapping intervals of unequal size.
我的理解是,pd.cut()只允许我按非重叠的时间间隔进行分组。
My understanding is that pd.cut() only allows me to group by non-overlapping intervals.
这是一个例子:
0 1 2
0 0 4 1721
1 0 5 2353
2 0 6 58
3 0 7 524
4 1 1 1934
5 1 2 1318
6 1 2 1307
7 1 2 301
8 1 2 502
9 1 3 996
10 1 3 32
按列0和1分组我想要:
By grouping by column 0 and 1 I want:
0 1 2
0 [4,5] [1721,2353]
[5,6] [2353,58]
[6,7] [58,524]
1 [1,2] [1934,1318,1307,301,502]
[2,3] [1318,1307,301,502,996,32]
然后我会采取第2列的平均或标准。任何建议?谢谢 !
I would then take mean or std of column 2. Any suggestion? Thanks !
推荐答案
开始于:
gr1 gr2 val
0 0 4 1721
1 0 5 2353
2 0 6 58
3 0 7 524
4 1 1 1934
5 1 2 1318
6 1 2 1307
7 1 2 301
8 1 2 502
9 1 3 996
10 1 3 32
首先,从 gr2
中的值创建bin:
First, create bins from values in gr2
:
bounds = df.gr2.sort_values().unique()
bins = list(zip(bounds[:-1], bounds[1:]))
def overlapping_bins(x):
return pd.Series([l for l in bins if l[0] <= x <= l[1]])
然后将 val
code> bins :
Then assign val
values to bins
:
df = pd.concat([df, df.gr2.apply(overlapping_bins).stack().reset_index(1, drop=True)], axis=1).rename(columns={0: 'bins'}).drop('gr2', axis=1)
然后 .groupby()
result bins
:
And then .groupby()
resulting bins
:
df.groupby(['gr1', 'bins']).val.apply(lambda x: x.tolist())
gr1 bins
0 (3, 4) [1721]
(4, 5) [1721, 2353]
(5, 6) [2353, 58]
(6, 7) [58, 524]
1 (1, 2) [1934, 1318, 1307, 301, 502]
(2, 3) [1318, 1307, 301, 502, 996, 32]
(3, 4) [996, 32]
这篇关于Pandas DataFrame组以可变长度重叠的间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!