在R中的矩阵中查找模式 [英] Finding pattern in a matrix in R

查看:81
本文介绍了在R中的矩阵中查找模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我有一个8 x n的矩阵

set.seed(12345)
m <- matrix(sample(1:50, 800, replace=T), ncol=8)
head(m)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]   37   15   30    3    4   11   35   31
[2,]   44   31   45   30   24   39    1   18
[3,]   39   49    7   36   14   43   26   24
[4,]   45   31   26   33   12   47   37   15
[5,]   23   27   34   29   30   34   17    4
[6,]    9   46   39   34    8   43   42   37

我想在矩阵中找到特定的模式,例如,我想知道在哪里可以找到37,然后在下一行中分别输入10和29,在其后一行中输入42

例如,发生在上述矩阵的第57:59行

m[57:59,]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]  *37   35    1   30   47    9   12   39
[2,]    5   22  *10  *29   13    5   17   36
[3,]   22   43    6    2   27   35  *42   50

一种解决方案(可能效率不高)是使所有包含37个行的行都带有

sapply(1:nrow(m), function(x){37 %in% m[x,]})

然后使用一些循环来测试其他条件.

我如何编写一个有效的函数来做到这一点,可以将其推广到任何用户给定的模式(不一定要遍历3行,可能有空洞",每行中值的数目可变等)? /p>

回答各种评论

  • 我需要找到精确的模式
  • 同一行中的顺序无关紧要(如果它使事情变得更容易,则可以在每一行中对值进行排序)
  • 线必须相邻.
  • 我想获取所有返回的模式的(开始)位置(即,如果该模式在矩阵中多次出现,我希望有多个返回值).
  • 用户将通过GUI输入模式,但我尚未决定如何进行.例如,要搜索上述模式,他可能会写类似

37;10,29;42

其中;代表新行,而,分隔同一行中的值. 同样,我们可能会寻找

50,51;;75;80,81

n行中的50和51,n + 2行中的75,n + 3行中的80和81

解决方案

以下是广义函数:

PatternMatcher <- function(data, pattern, idx = NULL) {
  p <- unlist(pattern[1])
  if(is.null(idx)){
    p <- unlist(pattern[length(pattern)])
    PatternMatcher(data, rev(pattern)[-1], 
                   idx = Filter(function(n) all(p %in% intersect(data[n, ], p)),
                                1:nrow(data)))
  } else if(length(pattern) > 1) {
    PatternMatcher(data, pattern[-1], 
                   idx = Filter(function(n) all(p %in% intersect(data[n, ], p)), 
                                idx - 1))
  } else
    Filter(function(n) all(p %in% intersect(data[n, ], p)), idx - 1)
}

这是一个递归函数,它将在每次迭代中减少pattern,并且仅检查在上次迭代中标识的行之后的行.列表结构允许以一种方便的方式传递模式:

PatternMatcher(m, list(37, list(10, 29), 42))
# [1] 57
PatternMatcher(m, list(list(45, 24, 1), 7, list(45, 31), 4))
# [1] 2
PatternMatcher(m, list(1,3))
# [1] 47 48 93

上面函数的想法似乎很好:检查向量pattern[[1]]的所有行并获取索引r1,然后检查r1+1的行pattern[[2]]并获取r2等.但是遍历所有行时,第一步确实要花费很多时间.当然,例如m <- matrix(sample(1:10, 800, replace=T), ncol=8),即当索引r1r2的变化不大时... ...因此这是另一种方法,此处PatternMatcher看起来非常相似,但是还有另一个函数matchRow用于查找具有vector所有元素的行.

matchRow <- function(data, vector, idx = NULL){
  if(is.null(idx)){
    matchRow(data, vector[-1], 
             as.numeric(unique(rownames(which(data == vector[1], arr.ind = TRUE)))))
  } else if(length(vector) > 0) {
    matchRow(data, vector[-1], 
             as.numeric(unique(rownames(which(data[idx, , drop = FALSE] == vector[1], arr.ind = TRUE)))))
  } else idx
}
PatternMatcher <- function(data, pattern, idx = NULL) {
  p <- pattern[[1]]
  if(is.null(idx)){
    rownames(data) <- 1:nrow(data)
    p <- pattern[[length(pattern)]]
    PatternMatcher(data, rev(pattern)[-1], idx = matchRow(data, p))
  } else if(length(pattern) > 1) {
    PatternMatcher(data, pattern[-1], idx = matchRow(data, p, idx - 1))
  } else
    matchRow(data, p, idx - 1)
}

与上一个功能的比较:

library(rbenchmark)
bigM <- matrix(sample(1:50, 800000, replace=T), ncol=8)
benchmark(PatternMatcher(bigM, list(37, c(10, 29), 42)), 
          PatternMatcher(bigM, list(1, 3)), 
          OldPatternMatcher(bigM, list(37, list(10, 29), 42)), 
          OldPatternMatcher(bigM, list(1, 3)), 
          replications = 10,
          columns = c("test", "elapsed"))
#                                                  test elapsed
# 4                 OldPatternMatcher(bigM, list(1, 3))   61.14
# 3 OldPatternMatcher(bigM, list(37, list(10, 29), 42))   63.28
# 2                    PatternMatcher(bigM, list(1, 3))    1.58
# 1       PatternMatcher(bigM, list(37, c(10, 29), 42))    2.02

verybigM1 <- matrix(sample(1:40, 8000000, replace=T), ncol=20)
verybigM2 <- matrix(sample(1:140, 8000000, replace=T), ncol=20)
benchmark(PatternMatcher(verybigM1, list(37, c(10, 29), 42)), 
          PatternMatcher(verybigM2, list(37, c(10, 29), 42)), 
          find.combo(verybigM1, convert.gui.input("37;10,29;42")),
          find.combo(verybigM2, convert.gui.input("37;10,29;42")),          
          replications = 20,
          columns = c("test", "elapsed"))
#                                                      test elapsed
# 3 find.combo(verybigM1, convert.gui.input("37;10,29;42"))   17.55
# 4 find.combo(verybigM2, convert.gui.input("37;10,29;42"))   18.72
# 1      PatternMatcher(verybigM1, list(37, c(10, 29), 42))   15.84
# 2      PatternMatcher(verybigM2, list(37, c(10, 29), 42))   19.62

现在,pattern参数应该类似于list(37, c(10, 29), 42)而不是list(37, list(10, 29), 42).最后:

fastPattern <- function(data, pattern)
  PatternMatcher(data, lapply(strsplit(pattern, ";")[[1]], 
                    function(i) as.numeric(unlist(strsplit(i, split = ",")))))
fastPattern(m, "37;10,29;42")
# [1] 57
fastPattern(m, "37;;42")
# [1] 57  4
fastPattern(m, "37;;;42")
# [1] 33 56 77

I have a 8 x n matrix, for instance

set.seed(12345)
m <- matrix(sample(1:50, 800, replace=T), ncol=8)
head(m)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]   37   15   30    3    4   11   35   31
[2,]   44   31   45   30   24   39    1   18
[3,]   39   49    7   36   14   43   26   24
[4,]   45   31   26   33   12   47   37   15
[5,]   23   27   34   29   30   34   17    4
[6,]    9   46   39   34    8   43   42   37

I would like to find a certain pattern in the matrix, for instance I would like to know where I can find a 37, followed in the next line by a 10 and a 29 and the line after by a 42

This happens, for instance, in lines 57:59 of the above matrix

m[57:59,]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]  *37   35    1   30   47    9   12   39
[2,]    5   22  *10  *29   13    5   17   36
[3,]   22   43    6    2   27   35  *42   50

A (probably inefficient) solution is to get all the lines containing 37 with

sapply(1:nrow(m), function(x){37 %in% m[x,]})

And then use a few loops to test the other conditions.

How could I write an efficient function to do this, that can be generalized to any user-given pattern (not necessarily over 3 lines, with possible "holes", with variable number of values in each line etc).?

EDIT: to answer various comments

  • I need to find the EXACT pattern
  • The order in the same row does not matter (if it makes things easier values can be ordered in each row)
  • The lines have to be adjacent.
  • I want to get the (starting) position of all the pattern returned (i.e., if the pattern is present multiple times in the matrix I want multiple return values).
  • The user would enter the pattern via a GUI, I have yet to decide how. For instance, to search for the above pattern he may write something like

37;10,29;42

Where ; represents a new line and , separates values on the same line. Similarly we may look for

50,51;;75;80,81

Meaning 50 and 51 in line n, 75 in line n+2, and 80 and 81 in line n+3

解决方案

Here is a generalized function:

PatternMatcher <- function(data, pattern, idx = NULL) {
  p <- unlist(pattern[1])
  if(is.null(idx)){
    p <- unlist(pattern[length(pattern)])
    PatternMatcher(data, rev(pattern)[-1], 
                   idx = Filter(function(n) all(p %in% intersect(data[n, ], p)),
                                1:nrow(data)))
  } else if(length(pattern) > 1) {
    PatternMatcher(data, pattern[-1], 
                   idx = Filter(function(n) all(p %in% intersect(data[n, ], p)), 
                                idx - 1))
  } else
    Filter(function(n) all(p %in% intersect(data[n, ], p)), idx - 1)
}

This is a recursive function which is reducing pattern in every iteration and checks only rows that go right after ones identified in the previous iteration. List structure allows passing the pattern in a convenient way:

PatternMatcher(m, list(37, list(10, 29), 42))
# [1] 57
PatternMatcher(m, list(list(45, 24, 1), 7, list(45, 31), 4))
# [1] 2
PatternMatcher(m, list(1,3))
# [1] 47 48 93

Edit: The idea of the function above seems fine: check all rows for the vector pattern[[1]] and get indices r1, then check rows r1+1 for pattern[[2]] and get r2, etc. But it takes really much time at the first step when going through all rows. Of course, every step would take much time with e.g. m <- matrix(sample(1:10, 800, replace=T), ncol=8), i.e. when there is not much of a change in indices r1, r2, ... So here is another approach, here PatternMatcher looks very similar, but there is another function matchRow for finding rows that have all elements of vector.

matchRow <- function(data, vector, idx = NULL){
  if(is.null(idx)){
    matchRow(data, vector[-1], 
             as.numeric(unique(rownames(which(data == vector[1], arr.ind = TRUE)))))
  } else if(length(vector) > 0) {
    matchRow(data, vector[-1], 
             as.numeric(unique(rownames(which(data[idx, , drop = FALSE] == vector[1], arr.ind = TRUE)))))
  } else idx
}
PatternMatcher <- function(data, pattern, idx = NULL) {
  p <- pattern[[1]]
  if(is.null(idx)){
    rownames(data) <- 1:nrow(data)
    p <- pattern[[length(pattern)]]
    PatternMatcher(data, rev(pattern)[-1], idx = matchRow(data, p))
  } else if(length(pattern) > 1) {
    PatternMatcher(data, pattern[-1], idx = matchRow(data, p, idx - 1))
  } else
    matchRow(data, p, idx - 1)
}

Comparison with the previous function:

library(rbenchmark)
bigM <- matrix(sample(1:50, 800000, replace=T), ncol=8)
benchmark(PatternMatcher(bigM, list(37, c(10, 29), 42)), 
          PatternMatcher(bigM, list(1, 3)), 
          OldPatternMatcher(bigM, list(37, list(10, 29), 42)), 
          OldPatternMatcher(bigM, list(1, 3)), 
          replications = 10,
          columns = c("test", "elapsed"))
#                                                  test elapsed
# 4                 OldPatternMatcher(bigM, list(1, 3))   61.14
# 3 OldPatternMatcher(bigM, list(37, list(10, 29), 42))   63.28
# 2                    PatternMatcher(bigM, list(1, 3))    1.58
# 1       PatternMatcher(bigM, list(37, c(10, 29), 42))    2.02

verybigM1 <- matrix(sample(1:40, 8000000, replace=T), ncol=20)
verybigM2 <- matrix(sample(1:140, 8000000, replace=T), ncol=20)
benchmark(PatternMatcher(verybigM1, list(37, c(10, 29), 42)), 
          PatternMatcher(verybigM2, list(37, c(10, 29), 42)), 
          find.combo(verybigM1, convert.gui.input("37;10,29;42")),
          find.combo(verybigM2, convert.gui.input("37;10,29;42")),          
          replications = 20,
          columns = c("test", "elapsed"))
#                                                      test elapsed
# 3 find.combo(verybigM1, convert.gui.input("37;10,29;42"))   17.55
# 4 find.combo(verybigM2, convert.gui.input("37;10,29;42"))   18.72
# 1      PatternMatcher(verybigM1, list(37, c(10, 29), 42))   15.84
# 2      PatternMatcher(verybigM2, list(37, c(10, 29), 42))   19.62

Also now the pattern argument should be like list(37, c(10, 29), 42) instead of list(37, list(10, 29), 42). And finally:

fastPattern <- function(data, pattern)
  PatternMatcher(data, lapply(strsplit(pattern, ";")[[1]], 
                    function(i) as.numeric(unlist(strsplit(i, split = ",")))))
fastPattern(m, "37;10,29;42")
# [1] 57
fastPattern(m, "37;;42")
# [1] 57  4
fastPattern(m, "37;;;42")
# [1] 33 56 77

这篇关于在R中的矩阵中查找模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆