Pandas - 查找和索引与行序列模式匹配的行 [英] Pandas - Find and index rows that match row sequence pattern
问题描述
我想在向下行的分类变量中的数据框中找到模式.我可以看到如何使用 Series.shift() 向上/向下查找并使用布尔逻辑来查找模式,但是,我想使用分组变量来执行此操作,并标记属于该模式的所有行,而不仅仅是起始行.
代码:
将pandas导入为pd从 numpy.random 导入选择,randn导入字符串#df构造函数n_rows = 1000df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', period=n_rows, freq='H'),'group_var':选择(列表(string.ascii_uppercase),n_rows),'row_pat': 选择([0, 1, 2, 3], n_rows),'值':randn(n_rows)})# 排序df.sort_values(by=['group_var', 'date_time'], inplace=True)df.head(10)
返回这个:
我可以通过以下方式找到模式的开头(虽然没有分组):
# 要检测的行序数模式p0, p1, p2, p3 = 1, 2, 2, 0# 标记模式开始处的行df['pat_flag'] = df['row_pat'].eq(p0) &df['row_pat'].shift(-1).eq(p1) &df['row_pat'].shift(-2).eq(p2) &df['row_pat'].shift(-3).eq(p3)df.head(10)
我无法弄清楚的是,如何仅使用group_var"来执行此操作,而不是为模式的开头返回 True,而是为属于该模式的所有行返回 true.
感谢有关如何解决此问题的任何提示!
谢谢...
我认为您有两种方法 - 更简单和更慢的解决方案或更快的复杂解决方案.
<小时>pat = np.asarray([1, 2, 2, 0])N = len(pat)df['rm0'] = (df['row_pat'].rolling(window=N, min_periods=N).apply(lambda x: (x==pat).all()).mask(lambda x: x == 0).bfill(限制=N-1).fillna(0).astype(布尔))
如果是重要的性能,使用strides
,来自链接的解决方案被修改:>
- 使用滚动窗口方法
- 与pattaern比较并返回
True
s以匹配all
- 通过
np.mgrid 获取首次出现的索引
和索引 - 使用列表理解创建所有索引
- 通过
numpy.in1d
进行比较 并创建新列
defrolling_window(a, window):shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)步幅 = a.strides + (a.strides[-1],)c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)返回 carr = df['row_pat'].valuesb = np.all(rolling_window(arr, N) == pat,axis=1)c = np.mgrid[0:len(b)][b]d = [i for x in c for i in range(x, x+N)]df['rm2'] = np.in1d(np.arange(len(arr)), d)
另一种解决方案,谢谢@divakar:
arr = df['row_pat'].valuesb = np.all(rolling_window(arr, N) == pat,axis=1)m = (rolling_window(arr, len(pat)) == pat).all(1)m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
时间:
np.random.seed(456)将熊猫导入为 pd从 numpy.random 导入选择,randn从 scipy.ndimage.morphology 导入 binary_dilation导入字符串#df构造函数n_rows = 100000df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', period=n_rows, freq='H'),'group_var':选择(列表(string.ascii_uppercase),n_rows),'row_pat': 选择([0, 1, 2, 3], n_rows),'值':randn(n_rows)})# 排序df.sort_values(by=['group_var', 'date_time'], inplace=True)
<小时>
defrolling_window(a, window):shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)步幅 = a.strides + (a.strides[-1],)c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)返回 carr = df['row_pat'].valuesb = np.all(rolling_window(arr, N) == pat,axis=1)m = (rolling_window(arr, len(pat)) == pat).all(1)m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))arr = df['row_pat'].valuesb = np.all(rolling_window(arr, N) == pat,axis=1)c = np.mgrid[0:len(b)][b]d = [i for x in c for i in range(x, x+N)]df['rm2'] = np.in1d(np.arange(len(arr)), d)
<小时>
print (df.iloc[460:480])date_time group_var row_pat 值 rm0 rm1 rm212045 2019-06-25 21:00:00 A 3 -0.081152 假假假12094 2019-06-27 22:00:00 A 1 -0.818167 假假假12125 2019-06-29 05:00:00 A 0 -0.051088 假假假12143 2019-06-29 23:00:00 A 0 -0.937589 假假假12145 2019-06-30 01:00:00 A 3 0.298460 假假假12158 2019-06-30 14:00:00 A 1 0.647161 假假假12164 2019-06-30 20:00:00 A 3 -0.735538 假假假12210 2019-07-02 18:00:00 A 1 -0.881740 假假假12341 2019-07-08 05:00:00 A 3 0.525652 假假假12343 2019-07-08 07:00:00 A 1 0.311598 假假假12358 2019-07-08 22:00:00 A 1 -0.710150 真真真12360 2019-07-09 00:00:00 A 2 -0.752216 真真真12400 2019-07-10 16:00:00 A 2 -0.205122 真真真12404 2019-07-10 20:00:00 A 0 1.342591 真真真12413 2019-07-11 05:00:00 A 1 1.707748 假假假12506 2019-07-15 02:00:00 A 2 0.319227 假假假12527 2019-07-15 23:00:00 A 3 2.130917 假假假12600 2019-07-19 00:00:00 A 1 -1.314070 假假假12604 2019-07-19 04:00:00 A 0 0.869059 假假假12613 2019-07-19 13:00:00 A 2 1.342101 假假假
<小时>
在 [225]: %%timeit...: df['rm0'] = (df['row_pat'].rolling(window=N, min_periods=N)...: .apply(lambda x: (x==pat).all())...: .mask(lambda x: x == 0)...: .bfill(limit=N-1)...: .fillna(0)...: .astype(bool)...:)...:1 个循环,最好的 3 个:每个循环 356 毫秒在 [226] 中:%%timeit...: arr = df['row_pat'].values...: b = np.all(rolling_window(arr, N) == pat,axis=1)...: c = np.mgrid[0:len(b)][b]...: d = [i for x in c for i in range(x, x+N)]...: df['rm2'] = np.in1d(np.arange(len(arr)), d)...:100 个循环,最好的 3 个:每个循环 7.63 毫秒在 [227] 中:%%timeit...: arr = df['row_pat'].values...: b = np.all(rolling_window(arr, N) == pat,axis=1)...:...: m = (rolling_window(arr, len(pat)) == pat).all(1)...: m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]...: df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))...:100 个循环,最好的 3 个:每个循环 7.25 毫秒
I would like to find a pattern in a dataframe in a categorical variable going down rows. I can see how to use Series.shift() to look up / down and using boolean logic to find the pattern, however, I want to do this with a grouping variable and also label all rows that are part of the pattern, not just the starting row.
Code:
import pandas as pd
from numpy.random import choice, randn
import string
# df constructor
n_rows = 1000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
'group_var': choice(list(string.ascii_uppercase), n_rows),
'row_pat': choice([0, 1, 2, 3], n_rows),
'values': randn(n_rows)})
# sorting
df.sort_values(by=['group_var', 'date_time'], inplace=True)
df.head(10)
Which returns this:
I can find the start of the pattern (with no grouping though) by this:
# the row ordinal pattern to detect
p0, p1, p2, p3 = 1, 2, 2, 0
# flag the row at the start of the pattern
df['pat_flag'] =
df['row_pat'].eq(p0) &
df['row_pat'].shift(-1).eq(p1) &
df['row_pat'].shift(-2).eq(p2) &
df['row_pat'].shift(-3).eq(p3)
df.head(10)
What i cant figure out, is how to do this only withing the "group_var", and instead of returning True for the start of the pattern, return true for all rows that are part of the pattern.
Appreciate any tips on how to solve this!
Thanks...
I think you have 2 ways - simplier and slowier solution or faster complicated.
- use
Rolling.apply
and test pattern - replace
0
s toNaN
s bymask
- use
bfill
withlimit
(same asfillna
withmethod='bfill'
) for repeat1
- then
fillna
NaN
s to0
- last cast to bool by
astype
pat = np.asarray([1, 2, 2, 0])
N = len(pat)
df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all())
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool)
)
If is important performance, use strides
, solution from link was modify:
- use rolling window approach
- compare with pattaern and return
True
s for match byall
- get indices of first occurencies by
np.mgrid
and indexing - create all indices with list comprehension
- compare by
numpy.in1d
and create new column
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)
Another solution, thanks @divakar:
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
Timings:
np.random.seed(456)
import pandas as pd
from numpy.random import choice, randn
from scipy.ndimage.morphology import binary_dilation
import string
# df constructor
n_rows = 100000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
'group_var': choice(list(string.ascii_uppercase), n_rows),
'row_pat': choice([0, 1, 2, 3], n_rows),
'values': randn(n_rows)})
# sorting
df.sort_values(by=['group_var', 'date_time'], inplace=True)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)
print (df.iloc[460:480])
date_time group_var row_pat values rm0 rm1 rm2
12045 2019-06-25 21:00:00 A 3 -0.081152 False False False
12094 2019-06-27 22:00:00 A 1 -0.818167 False False False
12125 2019-06-29 05:00:00 A 0 -0.051088 False False False
12143 2019-06-29 23:00:00 A 0 -0.937589 False False False
12145 2019-06-30 01:00:00 A 3 0.298460 False False False
12158 2019-06-30 14:00:00 A 1 0.647161 False False False
12164 2019-06-30 20:00:00 A 3 -0.735538 False False False
12210 2019-07-02 18:00:00 A 1 -0.881740 False False False
12341 2019-07-08 05:00:00 A 3 0.525652 False False False
12343 2019-07-08 07:00:00 A 1 0.311598 False False False
12358 2019-07-08 22:00:00 A 1 -0.710150 True True True
12360 2019-07-09 00:00:00 A 2 -0.752216 True True True
12400 2019-07-10 16:00:00 A 2 -0.205122 True True True
12404 2019-07-10 20:00:00 A 0 1.342591 True True True
12413 2019-07-11 05:00:00 A 1 1.707748 False False False
12506 2019-07-15 02:00:00 A 2 0.319227 False False False
12527 2019-07-15 23:00:00 A 3 2.130917 False False False
12600 2019-07-19 00:00:00 A 1 -1.314070 False False False
12604 2019-07-19 04:00:00 A 0 0.869059 False False False
12613 2019-07-19 13:00:00 A 2 1.342101 False False False
In [225]: %%timeit
...: df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
...: .apply(lambda x: (x==pat).all())
...: .mask(lambda x: x == 0)
...: .bfill(limit=N-1)
...: .fillna(0)
...: .astype(bool)
...: )
...:
1 loop, best of 3: 356 ms per loop
In [226]: %%timeit
...: arr = df['row_pat'].values
...: b = np.all(rolling_window(arr, N) == pat, axis=1)
...: c = np.mgrid[0:len(b)][b]
...: d = [i for x in c for i in range(x, x+N)]
...: df['rm2'] = np.in1d(np.arange(len(arr)), d)
...:
100 loops, best of 3: 7.63 ms per loop
In [227]: %%timeit
...: arr = df['row_pat'].values
...: b = np.all(rolling_window(arr, N) == pat, axis=1)
...:
...: m = (rolling_window(arr, len(pat)) == pat).all(1)
...: m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
...: df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
...:
100 loops, best of 3: 7.25 ms per loop
这篇关于Pandas - 查找和索引与行序列模式匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!