在 pandas 数据框中查找连续的片段 [英] Finding consecutive segments in a pandas data frame
问题描述
我有一个pandas.DataFrame,它在连续的时间点进行了测量.与每次测量一起,被观察的系统在每个时间点都有不同的状态.因此,DataFrame还包含一列,其中包含每次测量时系统的状态.状态更改比测量间隔慢得多.结果,指示状态的列可能看起来像这样(索引:状态):
I have a pandas.DataFrame with measurements taken at consecutive points in time. Along with each measurement the system under observation had a distinct state at each point in time. Hence, the DataFrame also contains a column with the state of the system at each measurement. State changes are much slower than the measurement interval. As a result, the column indicating the states might look like this (index: state):
1: 3
2: 3
3: 3
4: 3
5: 4
6: 4
7: 4
8: 4
9: 1
10: 1
11: 1
12: 1
13: 1
是否有一种简单的方法来检索连续相等状态的每个段的索引.那意味着我想得到这样的东西:
Is there an easy way to retrieve the indices of each segment of consecutively equal states. That means I would like to get something like this:
[[1,2,3,4], [5,6,7,8], [9,10,11,12,13]]
结果也可能与普通列表有所不同.
The result might also be in something different than plain lists.
到目前为止,我唯一想到的解决方案是手动遍历行,查找段更改点并从这些更改点重建索引,但是我希望有一个更简单的解决方案.
The only solution I could think of so far is manually iterating over the rows, finding segment change points and reconstructing the indices from these change points, but I have the hope that there is an easier solution.
推荐答案
单线:
df.reset_index().groupby('A')['index'].apply(np.array)
例如代码:
In [1]: import numpy as np
In [2]: from pandas import *
In [3]: df = DataFrame([3]*4+[4]*4+[1]*4, columns=['A'])
In [4]: df
Out[4]:
A
0 3
1 3
2 3
3 3
4 4
5 4
6 4
7 4
8 1
9 1
10 1
11 1
In [5]: df.reset_index().groupby('A')['index'].apply(np.array)
Out[5]:
A
1 [8, 9, 10, 11]
3 [0, 1, 2, 3]
4 [4, 5, 6, 7]
您还可以直接从groupby对象访问信息:
You can also directly access the information from the groupby object:
In [1]: grp = df.groupby('A')
In [2]: grp.indices
Out[2]:
{1L: array([ 8, 9, 10, 11], dtype=int64),
3L: array([0, 1, 2, 3], dtype=int64),
4L: array([4, 5, 6, 7], dtype=int64)}
In [3]: grp.indices[3]
Out[3]: array([0, 1, 2, 3], dtype=int64)
要解决DSM提到的情况,您可以执行以下操作:
To address the situation that DSM mentioned you could do something like:
In [1]: df['block'] = (df.A.shift(1) != df.A).astype(int).cumsum()
In [2]: df
Out[2]:
A block
0 3 1
1 3 1
2 3 1
3 3 1
4 4 2
5 4 2
6 4 2
7 4 2
8 1 3
9 1 3
10 1 3
11 1 3
12 3 4
13 3 4
14 3 4
15 3 4
现在对这两列进行分组,并应用lambda函数:
Now groupby both columns and apply the lambda function:
In [77]: df.reset_index().groupby(['A','block'])['index'].apply(np.array)
Out[77]:
A block
1 3 [8, 9, 10, 11]
3 1 [0, 1, 2, 3]
4 [12, 13, 14, 15]
4 2 [4, 5, 6, 7]
这篇关于在 pandas 数据框中查找连续的片段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!