Pandas - 查找和索引与行序列模式匹配的行 [英] Pandas - Find and index rows that match row sequence pattern

查看:163
本文介绍了Pandas - 查找和索引与行序列模式匹配的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在分类变量的数据框中找到一个向下行的模式。我可以看到如何使用Series.shift()来查找/关闭并使用布尔逻辑来查找模式,但是,我想用分组变量执行此操作,并且还标记作为模式一部分的所有行,而不仅仅是起始行。

I would like to find a pattern in a dataframe in a categorical variable going down rows. I can see how to use Series.shift() to look up / down and using boolean logic to find the pattern, however, I want to do this with a grouping variable and also label all rows that are part of the pattern, not just the starting row.

代码:

import pandas as pd
from numpy.random import choice, randn
import string

# df constructor
n_rows = 1000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
                   'group_var': choice(list(string.ascii_uppercase), n_rows),
                   'row_pat': choice([0, 1, 2, 3], n_rows),
                   'values': randn(n_rows)})

# sorting 
df.sort_values(by=['group_var', 'date_time'], inplace=True)
df.head(10)

返回此内容:

我可以通过以下方式找到模式的开头(虽然没有分组):

I can find the start of the pattern (with no grouping though) by this:

# the row ordinal pattern to detect
p0, p1, p2, p3 = 1, 2, 2, 0 

# flag the row at the start of the pattern
df['pat_flag'] = \
df['row_pat'].eq(p0) & \
df['row_pat'].shift(-1).eq(p1) & \
df['row_pat'].shift(-2).eq(p2) & \
df['row_pat'].shift(-3).eq(p3)

df.head(10)

我无法弄清楚,是怎么回事要仅使用group_var执行此操作,而不是在模式的开头返回True,对于作为模式一部分的所有行都返回true。

What i cant figure out, is how to do this only withing the "group_var", and instead of returning True for the start of the pattern, return true for all rows that are part of the pattern.

欣赏关于如何解决这个问题的任何提示!

Appreciate any tips on how to solve this!

谢谢......

推荐答案

我认为你有两种方法 - 更简单,更慢的解决方案或更快的复杂。

I think you have 2 ways - simplier and slowier solution or faster complicated.

  • use Rolling.apply and test pattern
  • replace 0s to NaNs by mask
  • use bfill with limit (same as fillna with method='bfill') for repeat 1
  • then fillna NaNs to 0
  • last cast to bool by astype
pat = np.asarray([1, 2, 2, 0])
N = len(pat)


df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
                          .apply(lambda x: (x==pat).all())
                          .mask(lambda x: x == 0) 
                          .bfill(limit=N-1)
                          .fillna(0)
                          .astype(bool)
             )

如果性能很重要,请使用 strides 链接被修改:

If is important performance, use strides, solution from link was modify:


  • 使用滚动窗口方法

  • 与pattaern比较并返回 True 匹配所有

  • 获取第一次出现的索引 np.mgrid 并编制索引

  • 使用列表理解创建所有索引

  • 比较 numpy.in1d 并创建新列

  • use rolling window approach
  • compare with pattaern and return Trues for match by all
  • get indices of first occurencies by np.mgrid and indexing
  • create all indices with list comprehension
  • compare by numpy.in1d and create new column
def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)

另一个解决方案,感谢 @divakar

Another solution, thanks @divakar:

arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)

m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))

时间

np.random.seed(456) 

import pandas as pd
from numpy.random import choice, randn
from scipy.ndimage.morphology import binary_dilation
import string

# df constructor
n_rows = 100000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
                   'group_var': choice(list(string.ascii_uppercase), n_rows),
                   'row_pat': choice([0, 1, 2, 3], n_rows),
                   'values': randn(n_rows)})

# sorting 
df.sort_values(by=['group_var', 'date_time'], inplace=True)







def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c


arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)

m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))

arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)







print (df.iloc[460:480])

                date_time group_var  row_pat    values    rm0    rm1    rm2
12045 2019-06-25 21:00:00         A        3 -0.081152  False  False  False
12094 2019-06-27 22:00:00         A        1 -0.818167  False  False  False
12125 2019-06-29 05:00:00         A        0 -0.051088  False  False  False
12143 2019-06-29 23:00:00         A        0 -0.937589  False  False  False
12145 2019-06-30 01:00:00         A        3  0.298460  False  False  False
12158 2019-06-30 14:00:00         A        1  0.647161  False  False  False
12164 2019-06-30 20:00:00         A        3 -0.735538  False  False  False
12210 2019-07-02 18:00:00         A        1 -0.881740  False  False  False
12341 2019-07-08 05:00:00         A        3  0.525652  False  False  False
12343 2019-07-08 07:00:00         A        1  0.311598  False  False  False
12358 2019-07-08 22:00:00         A        1 -0.710150   True   True   True
12360 2019-07-09 00:00:00         A        2 -0.752216   True   True   True
12400 2019-07-10 16:00:00         A        2 -0.205122   True   True   True
12404 2019-07-10 20:00:00         A        0  1.342591   True   True   True
12413 2019-07-11 05:00:00         A        1  1.707748  False  False  False
12506 2019-07-15 02:00:00         A        2  0.319227  False  False  False
12527 2019-07-15 23:00:00         A        3  2.130917  False  False  False
12600 2019-07-19 00:00:00         A        1 -1.314070  False  False  False
12604 2019-07-19 04:00:00         A        0  0.869059  False  False  False
12613 2019-07-19 13:00:00         A        2  1.342101  False  False  False







In [225]: %%timeit
     ...: df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
     ...:                           .apply(lambda x: (x==pat).all())
     ...:                           .mask(lambda x: x == 0) 
     ...:                           .bfill(limit=N-1)
     ...:                           .fillna(0)
     ...:                           .astype(bool)
     ...:              )
     ...: 
1 loop, best of 3: 356 ms per loop

In [226]: %%timeit
     ...: arr = df['row_pat'].values
     ...: b = np.all(rolling_window(arr, N) == pat, axis=1)
     ...: c = np.mgrid[0:len(b)][b]
     ...: d = [i  for x in c for i in range(x, x+N)]
     ...: df['rm2'] = np.in1d(np.arange(len(arr)), d)
     ...: 
100 loops, best of 3: 7.63 ms per loop

In [227]: %%timeit
     ...: arr = df['row_pat'].values
     ...: b = np.all(rolling_window(arr, N) == pat, axis=1)
     ...: 
     ...: m = (rolling_window(arr, len(pat)) == pat).all(1)
     ...: m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
     ...: df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
     ...: 
100 loops, best of 3: 7.25 ms per loop

这篇关于Pandas - 查找和索引与行序列模式匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆