在R中的矩阵中查找模式 [英] Finding pattern in a matrix in R
问题描述
例如,我有一个8 x n的矩阵
set.seed(12345)
m <- matrix(sample(1:50, 800, replace=T), ncol=8)
head(m)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 37 15 30 3 4 11 35 31
[2,] 44 31 45 30 24 39 1 18
[3,] 39 49 7 36 14 43 26 24
[4,] 45 31 26 33 12 47 37 15
[5,] 23 27 34 29 30 34 17 4
[6,] 9 46 39 34 8 43 42 37
我想在矩阵中找到特定的模式,例如,我想知道在哪里可以找到37,然后在下一行中分别输入10和29,在其后一行中输入42
例如,发生在上述矩阵的第57:59行
m[57:59,]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] *37 35 1 30 47 9 12 39
[2,] 5 22 *10 *29 13 5 17 36
[3,] 22 43 6 2 27 35 *42 50
一种解决方案(可能效率不高)是使所有包含37个行的行都带有
sapply(1:nrow(m), function(x){37 %in% m[x,]})
然后使用一些循环来测试其他条件.
我如何编写一个有效的函数来做到这一点,可以将其推广到任何用户给定的模式(不一定要遍历3行,可能有空洞",每行中值的数目可变等)? /p>
回答各种评论
- 我需要找到精确的模式
- 同一行中的顺序无关紧要(如果它使事情变得更容易,则可以在每一行中对值进行排序)
- 线必须相邻.
- 我想获取所有返回的模式的(开始)位置(即,如果该模式在矩阵中多次出现,我希望有多个返回值).
- 用户将通过GUI输入模式,但我尚未决定如何进行.例如,要搜索上述模式,他可能会写类似
37;10,29;42
其中;
代表新行,而,
分隔同一行中的值.
同样,我们可能会寻找
50,51;;75;80,81
n行中的50和51,n + 2行中的75,n + 3行中的80和81
以下是广义函数:
PatternMatcher <- function(data, pattern, idx = NULL) {
p <- unlist(pattern[1])
if(is.null(idx)){
p <- unlist(pattern[length(pattern)])
PatternMatcher(data, rev(pattern)[-1],
idx = Filter(function(n) all(p %in% intersect(data[n, ], p)),
1:nrow(data)))
} else if(length(pattern) > 1) {
PatternMatcher(data, pattern[-1],
idx = Filter(function(n) all(p %in% intersect(data[n, ], p)),
idx - 1))
} else
Filter(function(n) all(p %in% intersect(data[n, ], p)), idx - 1)
}
这是一个递归函数,它将在每次迭代中减少pattern
,并且仅检查在上次迭代中标识的行之后的行.列表结构允许以一种方便的方式传递模式:
PatternMatcher(m, list(37, list(10, 29), 42))
# [1] 57
PatternMatcher(m, list(list(45, 24, 1), 7, list(45, 31), 4))
# [1] 2
PatternMatcher(m, list(1,3))
# [1] 47 48 93
上面函数的想法似乎很好:检查向量pattern[[1]]
的所有行并获取索引r1
,然后检查r1+1
的行pattern[[2]]
并获取r2
等.但是遍历所有行时,第一步确实要花费很多时间.当然,例如m <- matrix(sample(1:10, 800, replace=T), ncol=8)
,即当索引r1
,r2
的变化不大时... ...因此这是另一种方法,此处PatternMatcher
看起来非常相似,但是还有另一个函数matchRow
用于查找具有vector
所有元素的行.
matchRow <- function(data, vector, idx = NULL){
if(is.null(idx)){
matchRow(data, vector[-1],
as.numeric(unique(rownames(which(data == vector[1], arr.ind = TRUE)))))
} else if(length(vector) > 0) {
matchRow(data, vector[-1],
as.numeric(unique(rownames(which(data[idx, , drop = FALSE] == vector[1], arr.ind = TRUE)))))
} else idx
}
PatternMatcher <- function(data, pattern, idx = NULL) {
p <- pattern[[1]]
if(is.null(idx)){
rownames(data) <- 1:nrow(data)
p <- pattern[[length(pattern)]]
PatternMatcher(data, rev(pattern)[-1], idx = matchRow(data, p))
} else if(length(pattern) > 1) {
PatternMatcher(data, pattern[-1], idx = matchRow(data, p, idx - 1))
} else
matchRow(data, p, idx - 1)
}
与上一个功能的比较:
library(rbenchmark)
bigM <- matrix(sample(1:50, 800000, replace=T), ncol=8)
benchmark(PatternMatcher(bigM, list(37, c(10, 29), 42)),
PatternMatcher(bigM, list(1, 3)),
OldPatternMatcher(bigM, list(37, list(10, 29), 42)),
OldPatternMatcher(bigM, list(1, 3)),
replications = 10,
columns = c("test", "elapsed"))
# test elapsed
# 4 OldPatternMatcher(bigM, list(1, 3)) 61.14
# 3 OldPatternMatcher(bigM, list(37, list(10, 29), 42)) 63.28
# 2 PatternMatcher(bigM, list(1, 3)) 1.58
# 1 PatternMatcher(bigM, list(37, c(10, 29), 42)) 2.02
verybigM1 <- matrix(sample(1:40, 8000000, replace=T), ncol=20)
verybigM2 <- matrix(sample(1:140, 8000000, replace=T), ncol=20)
benchmark(PatternMatcher(verybigM1, list(37, c(10, 29), 42)),
PatternMatcher(verybigM2, list(37, c(10, 29), 42)),
find.combo(verybigM1, convert.gui.input("37;10,29;42")),
find.combo(verybigM2, convert.gui.input("37;10,29;42")),
replications = 20,
columns = c("test", "elapsed"))
# test elapsed
# 3 find.combo(verybigM1, convert.gui.input("37;10,29;42")) 17.55
# 4 find.combo(verybigM2, convert.gui.input("37;10,29;42")) 18.72
# 1 PatternMatcher(verybigM1, list(37, c(10, 29), 42)) 15.84
# 2 PatternMatcher(verybigM2, list(37, c(10, 29), 42)) 19.62
现在,pattern
参数应该类似于list(37, c(10, 29), 42)
而不是list(37, list(10, 29), 42)
.最后:
fastPattern <- function(data, pattern)
PatternMatcher(data, lapply(strsplit(pattern, ";")[[1]],
function(i) as.numeric(unlist(strsplit(i, split = ",")))))
fastPattern(m, "37;10,29;42")
# [1] 57
fastPattern(m, "37;;42")
# [1] 57 4
fastPattern(m, "37;;;42")
# [1] 33 56 77
I have a 8 x n matrix, for instance
set.seed(12345)
m <- matrix(sample(1:50, 800, replace=T), ncol=8)
head(m)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 37 15 30 3 4 11 35 31
[2,] 44 31 45 30 24 39 1 18
[3,] 39 49 7 36 14 43 26 24
[4,] 45 31 26 33 12 47 37 15
[5,] 23 27 34 29 30 34 17 4
[6,] 9 46 39 34 8 43 42 37
I would like to find a certain pattern in the matrix, for instance I would like to know where I can find a 37, followed in the next line by a 10 and a 29 and the line after by a 42
This happens, for instance, in lines 57:59 of the above matrix
m[57:59,]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] *37 35 1 30 47 9 12 39
[2,] 5 22 *10 *29 13 5 17 36
[3,] 22 43 6 2 27 35 *42 50
A (probably inefficient) solution is to get all the lines containing 37 with
sapply(1:nrow(m), function(x){37 %in% m[x,]})
And then use a few loops to test the other conditions.
How could I write an efficient function to do this, that can be generalized to any user-given pattern (not necessarily over 3 lines, with possible "holes", with variable number of values in each line etc).?
EDIT: to answer various comments
- I need to find the EXACT pattern
- The order in the same row does not matter (if it makes things easier values can be ordered in each row)
- The lines have to be adjacent.
- I want to get the (starting) position of all the pattern returned (i.e., if the pattern is present multiple times in the matrix I want multiple return values).
- The user would enter the pattern via a GUI, I have yet to decide how. For instance, to search for the above pattern he may write something like
37;10,29;42
Where ;
represents a new line and ,
separates values on the same line.
Similarly we may look for
50,51;;75;80,81
Meaning 50 and 51 in line n, 75 in line n+2, and 80 and 81 in line n+3
Here is a generalized function:
PatternMatcher <- function(data, pattern, idx = NULL) {
p <- unlist(pattern[1])
if(is.null(idx)){
p <- unlist(pattern[length(pattern)])
PatternMatcher(data, rev(pattern)[-1],
idx = Filter(function(n) all(p %in% intersect(data[n, ], p)),
1:nrow(data)))
} else if(length(pattern) > 1) {
PatternMatcher(data, pattern[-1],
idx = Filter(function(n) all(p %in% intersect(data[n, ], p)),
idx - 1))
} else
Filter(function(n) all(p %in% intersect(data[n, ], p)), idx - 1)
}
This is a recursive function which is reducing pattern
in every iteration and checks only rows that go right after ones identified in the previous iteration. List structure allows passing the pattern in a convenient way:
PatternMatcher(m, list(37, list(10, 29), 42))
# [1] 57
PatternMatcher(m, list(list(45, 24, 1), 7, list(45, 31), 4))
# [1] 2
PatternMatcher(m, list(1,3))
# [1] 47 48 93
Edit: The idea of the function above seems fine: check all rows for the vector pattern[[1]]
and get indices r1
, then check rows r1+1
for pattern[[2]]
and get r2
, etc. But it takes really much time at the first step when going through all rows. Of course, every step would take much time with e.g. m <- matrix(sample(1:10, 800, replace=T), ncol=8)
, i.e. when there is not much of a change in indices r1
, r2
, ... So here is another approach, here PatternMatcher
looks very similar, but there is another function matchRow
for finding rows that have all elements of vector
.
matchRow <- function(data, vector, idx = NULL){
if(is.null(idx)){
matchRow(data, vector[-1],
as.numeric(unique(rownames(which(data == vector[1], arr.ind = TRUE)))))
} else if(length(vector) > 0) {
matchRow(data, vector[-1],
as.numeric(unique(rownames(which(data[idx, , drop = FALSE] == vector[1], arr.ind = TRUE)))))
} else idx
}
PatternMatcher <- function(data, pattern, idx = NULL) {
p <- pattern[[1]]
if(is.null(idx)){
rownames(data) <- 1:nrow(data)
p <- pattern[[length(pattern)]]
PatternMatcher(data, rev(pattern)[-1], idx = matchRow(data, p))
} else if(length(pattern) > 1) {
PatternMatcher(data, pattern[-1], idx = matchRow(data, p, idx - 1))
} else
matchRow(data, p, idx - 1)
}
Comparison with the previous function:
library(rbenchmark)
bigM <- matrix(sample(1:50, 800000, replace=T), ncol=8)
benchmark(PatternMatcher(bigM, list(37, c(10, 29), 42)),
PatternMatcher(bigM, list(1, 3)),
OldPatternMatcher(bigM, list(37, list(10, 29), 42)),
OldPatternMatcher(bigM, list(1, 3)),
replications = 10,
columns = c("test", "elapsed"))
# test elapsed
# 4 OldPatternMatcher(bigM, list(1, 3)) 61.14
# 3 OldPatternMatcher(bigM, list(37, list(10, 29), 42)) 63.28
# 2 PatternMatcher(bigM, list(1, 3)) 1.58
# 1 PatternMatcher(bigM, list(37, c(10, 29), 42)) 2.02
verybigM1 <- matrix(sample(1:40, 8000000, replace=T), ncol=20)
verybigM2 <- matrix(sample(1:140, 8000000, replace=T), ncol=20)
benchmark(PatternMatcher(verybigM1, list(37, c(10, 29), 42)),
PatternMatcher(verybigM2, list(37, c(10, 29), 42)),
find.combo(verybigM1, convert.gui.input("37;10,29;42")),
find.combo(verybigM2, convert.gui.input("37;10,29;42")),
replications = 20,
columns = c("test", "elapsed"))
# test elapsed
# 3 find.combo(verybigM1, convert.gui.input("37;10,29;42")) 17.55
# 4 find.combo(verybigM2, convert.gui.input("37;10,29;42")) 18.72
# 1 PatternMatcher(verybigM1, list(37, c(10, 29), 42)) 15.84
# 2 PatternMatcher(verybigM2, list(37, c(10, 29), 42)) 19.62
Also now the pattern
argument should be like list(37, c(10, 29), 42)
instead of list(37, list(10, 29), 42)
. And finally:
fastPattern <- function(data, pattern)
PatternMatcher(data, lapply(strsplit(pattern, ";")[[1]],
function(i) as.numeric(unlist(strsplit(i, split = ",")))))
fastPattern(m, "37;10,29;42")
# [1] 57
fastPattern(m, "37;;42")
# [1] 57 4
fastPattern(m, "37;;;42")
# [1] 33 56 77
这篇关于在R中的矩阵中查找模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!