Pandas - 查找和索引与行序列模式匹配的行 [英] Pandas - Find and index rows that match row sequence pattern

查看:31
本文介绍了Pandas - 查找和索引与行序列模式匹配的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在向下行的分类变量中的数据框中找到模式.我可以看到如何使用 Series.shift() 向上/向下查找并使用布尔逻辑来查找模式,但是,我想使用分组变量来执行此操作,并标记属于该模式的所有行,而不仅仅是起始行.

代码:

将pandas导入为pd从 numpy.random 导入选择,randn导入字符串#df构造函数n_rows = 1000df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', period=n_rows, freq='H'),'group_var':选择(列表(string.ascii_uppercase),n_rows),'row_pat': 选择([0, 1, 2, 3], n_rows),'值':randn(n_rows)})# 排序df.sort_values(by=['group_var', 'date_time'], inplace=True)df.head(10)

返回这个:

我可以通过以下方式找到模式的开头(虽然没有分组):

# 要检测的行序数模式p0, p1, p2, p3 = 1, 2, 2, 0# 标记模式开始处的行df['pat_flag'] = df['row_pat'].eq(p0) &df['row_pat'].shift(-1).eq(p1) &df['row_pat'].shift(-2).eq(p2) &df['row_pat'].shift(-3).eq(p3)df.head(10)

我无法弄清楚的是,如何仅使用group_var"来执行此操作,而不是为模式的开头返回 True,而是为属于该模式的所有行返回 true.

感谢有关如何解决此问题的任何提示!

谢谢...

解决方案

我认为您有两种方法 - 更简单和更慢的解决方案或更快的复杂解决方案.

<小时>

pat = np.asarray([1, 2, 2, 0])N = len(pat)df['rm0'] = (df['row_pat'].rolling(window=N, min_periods=N).apply(lambda x: (x==pat).all()).mask(lambda x: x == 0).bfill(限制=N-1).fillna(0).astype(布尔))

如果是重要的性能,使用strides,来自链接的解决方案被修改:

<小时>

defrolling_window(a, window):shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)步幅 = a.strides + (a.strides[-1],)c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)返回 carr = df['row_pat'].valuesb = np.all(rolling_window(arr, N) == pat,axis=1)c = np.mgrid[0:len(b)][b]d = [i for x in c for i in range(x, x+N)]df['rm2'] = np.in1d(np.arange(len(arr)), d)

另一种解决方案,谢谢@divakar:

arr = df['row_pat'].valuesb = np.all(rolling_window(arr, N) == pat,axis=1)m = (rolling_window(arr, len(pat)) == pat).all(1)m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))

时间:

np.random.seed(456)将熊猫导入为 pd从 numpy.random 导入选择,randn从 scipy.ndimage.morphology 导入 binary_dilation导入字符串#df构造函数n_rows = 100000df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', period=n_rows, freq='H'),'group_var':选择(列表(string.ascii_uppercase),n_rows),'row_pat': 选择([0, 1, 2, 3], n_rows),'值':randn(n_rows)})# 排序df.sort_values(by=['group_var', 'date_time'], inplace=True)

<小时>

defrolling_window(a, window):shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)步幅 = a.strides + (a.strides[-1],)c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)返回 carr = df['row_pat'].valuesb = np.all(rolling_window(arr, N) == pat,axis=1)m = (rolling_window(arr, len(pat)) == pat).all(1)m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))arr = df['row_pat'].valuesb = np.all(rolling_window(arr, N) == pat,axis=1)c = np.mgrid[0:len(b)][b]d = [i for x in c for i in range(x, x+N)]df['rm2'] = np.in1d(np.arange(len(arr)), d)

<小时>

print (df.iloc[460:480])date_time group_var row_pat 值 rm0 rm1 rm212045 2019-06-25 21:00:00 A 3 -0.081152 假假假12094 2019-06-27 22:00:00 A 1 -0.818167 假假假12125 2019-06-29 05:00:00 A 0 -0.051088 假假假12143 2019-06-29 23:00:00 A 0 -0.937589 假假假12145 2019-06-30 01:00:00 A 3 0.298460 假假假12158 2019-06-30 14:00:00 A 1 0.647161 假假假12164 2019-06-30 20:00:00 A 3 -0.735538​​ 假假假12210 2019-07-02 18:00:00 A 1 -0.881740 假假假12341 2019-07-08 05:00:00 A 3 0.525652 假假假12343 2019-07-08 07:00:00 A 1 0.311598 假假假12358 2019-07-08 22:00:00 A 1 -0.710150 真真真12360 2019-07-09 00:00:00 A 2 -0.752216 真真真12400 2019-07-10 16:00:00 A 2 -0.205122 真真真12404 2019-07-10 20:00:00 A 0 1.342591 真真真12413 2019-07-11 05:00:00 A 1 1.707748 假假假12506 2019-07-15 02:00:00 A 2 0.319227 假假假12527 2019-07-15 23:00:00 A 3 2.130917 假假假12600 2019-07-19 00:00:00 A 1 -1.314070 假假假12604 2019-07-19 04:00:00 A 0 0.869059 假假假12613 2019-07-19 13:00:00 A 2 1.342101 假假假

<小时>

在 [225]: %%timeit...: df['rm0'] = (df['row_pat'].rolling(window=N, min_periods=N)...: .apply(lambda x: (x==pat).all())...: .mask(lambda x: x == 0)...: .bfill(limit=N-1)...: .fillna(0)...: .astype(bool)...:)...:1 个循环,最好的 3 个:每个循环 356 毫秒在 [226] 中:%%timeit...: arr = df['row_pat'].values...: b = np.all(rolling_window(arr, N) == pat,axis=1)...: c = np.mgrid[0:len(b)][b]...: d = [i for x in c for i in range(x, x+N)]...: df['rm2'] = np.in1d(np.arange(len(arr)), d)...:100 个循环,最好的 3 个:每个循环 7.63 毫秒在 [227] 中:%%timeit...: arr = df['row_pat'].values...: b = np.all(rolling_window(arr, N) == pat,axis=1)...:...: m = (rolling_window(arr, len(pat)) == pat).all(1)...: m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]...: df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))...:100 个循环,最好的 3 个:每个循环 7.25 毫秒

I would like to find a pattern in a dataframe in a categorical variable going down rows. I can see how to use Series.shift() to look up / down and using boolean logic to find the pattern, however, I want to do this with a grouping variable and also label all rows that are part of the pattern, not just the starting row.

Code:

import pandas as pd
from numpy.random import choice, randn
import string

# df constructor
n_rows = 1000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
                   'group_var': choice(list(string.ascii_uppercase), n_rows),
                   'row_pat': choice([0, 1, 2, 3], n_rows),
                   'values': randn(n_rows)})

# sorting 
df.sort_values(by=['group_var', 'date_time'], inplace=True)
df.head(10)

Which returns this:

I can find the start of the pattern (with no grouping though) by this:

# the row ordinal pattern to detect
p0, p1, p2, p3 = 1, 2, 2, 0 

# flag the row at the start of the pattern
df['pat_flag'] = 
df['row_pat'].eq(p0) & 
df['row_pat'].shift(-1).eq(p1) & 
df['row_pat'].shift(-2).eq(p2) & 
df['row_pat'].shift(-3).eq(p3)

df.head(10)

What i cant figure out, is how to do this only withing the "group_var", and instead of returning True for the start of the pattern, return true for all rows that are part of the pattern.

Appreciate any tips on how to solve this!

Thanks...

解决方案

I think you have 2 ways - simplier and slowier solution or faster complicated.

  • use Rolling.apply and test pattern
  • replace 0s to NaNs by mask
  • use bfill with limit (same as fillna with method='bfill') for repeat 1
  • then fillna NaNs to 0
  • last cast to bool by astype

pat = np.asarray([1, 2, 2, 0])
N = len(pat)


df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
                          .apply(lambda x: (x==pat).all())
                          .mask(lambda x: x == 0) 
                          .bfill(limit=N-1)
                          .fillna(0)
                          .astype(bool)
             )

If is important performance, use strides, solution from link was modify:

  • use rolling window approach
  • compare with pattaern and return Trues for match by all
  • get indices of first occurencies by np.mgrid and indexing
  • create all indices with list comprehension
  • compare by numpy.in1d and create new column

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)

Another solution, thanks @divakar:

arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)

m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))

Timings:

np.random.seed(456) 

import pandas as pd
from numpy.random import choice, randn
from scipy.ndimage.morphology import binary_dilation
import string

# df constructor
n_rows = 100000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
                   'group_var': choice(list(string.ascii_uppercase), n_rows),
                   'row_pat': choice([0, 1, 2, 3], n_rows),
                   'values': randn(n_rows)})

# sorting 
df.sort_values(by=['group_var', 'date_time'], inplace=True)


def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c


arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)

m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))

arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)


print (df.iloc[460:480])

                date_time group_var  row_pat    values    rm0    rm1    rm2
12045 2019-06-25 21:00:00         A        3 -0.081152  False  False  False
12094 2019-06-27 22:00:00         A        1 -0.818167  False  False  False
12125 2019-06-29 05:00:00         A        0 -0.051088  False  False  False
12143 2019-06-29 23:00:00         A        0 -0.937589  False  False  False
12145 2019-06-30 01:00:00         A        3  0.298460  False  False  False
12158 2019-06-30 14:00:00         A        1  0.647161  False  False  False
12164 2019-06-30 20:00:00         A        3 -0.735538  False  False  False
12210 2019-07-02 18:00:00         A        1 -0.881740  False  False  False
12341 2019-07-08 05:00:00         A        3  0.525652  False  False  False
12343 2019-07-08 07:00:00         A        1  0.311598  False  False  False
12358 2019-07-08 22:00:00         A        1 -0.710150   True   True   True
12360 2019-07-09 00:00:00         A        2 -0.752216   True   True   True
12400 2019-07-10 16:00:00         A        2 -0.205122   True   True   True
12404 2019-07-10 20:00:00         A        0  1.342591   True   True   True
12413 2019-07-11 05:00:00         A        1  1.707748  False  False  False
12506 2019-07-15 02:00:00         A        2  0.319227  False  False  False
12527 2019-07-15 23:00:00         A        3  2.130917  False  False  False
12600 2019-07-19 00:00:00         A        1 -1.314070  False  False  False
12604 2019-07-19 04:00:00         A        0  0.869059  False  False  False
12613 2019-07-19 13:00:00         A        2  1.342101  False  False  False


In [225]: %%timeit
     ...: df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
     ...:                           .apply(lambda x: (x==pat).all())
     ...:                           .mask(lambda x: x == 0) 
     ...:                           .bfill(limit=N-1)
     ...:                           .fillna(0)
     ...:                           .astype(bool)
     ...:              )
     ...: 
1 loop, best of 3: 356 ms per loop

In [226]: %%timeit
     ...: arr = df['row_pat'].values
     ...: b = np.all(rolling_window(arr, N) == pat, axis=1)
     ...: c = np.mgrid[0:len(b)][b]
     ...: d = [i  for x in c for i in range(x, x+N)]
     ...: df['rm2'] = np.in1d(np.arange(len(arr)), d)
     ...: 
100 loops, best of 3: 7.63 ms per loop

In [227]: %%timeit
     ...: arr = df['row_pat'].values
     ...: b = np.all(rolling_window(arr, N) == pat, axis=1)
     ...: 
     ...: m = (rolling_window(arr, len(pat)) == pat).all(1)
     ...: m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
     ...: df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
     ...: 
100 loops, best of 3: 7.25 ms per loop

这篇关于Pandas - 查找和索引与行序列模式匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆