R中模式的匹配 [英] Matching of patterns in R

查看:141
本文介绍了R中模式的匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.frame,具有两行和20列,其中每一列都包含一个字符,大致看起来像这样(为清楚起见,在这里将列缩小):

I have a data.frame with two rows and 20 columns where each column holds one character, which roughly looks like this (columns scrunched here for clarity):

        Cols 1-20
  row1  ghuytuthjilujshdftgu 
  row2  ghuytuthjilujshdftgu

我想要一种机制,用于从位置10开始并向外扫描逐个字符(逐列)比较这两个字符串,返回匹配字符的数量,直到遇到第一个差异为止.在这种情况下,很明显两行是相同的,因此答案将是20.重要的是,即使它们完全相同,如上面的情况,也不应出现错误消息(应将其返回)

I want a mechanism for comparing these two strings character by character (column by column) starting from position 10 and scanning outwards, returning the number of matching characters until the first difference is encountered. In this case it is obvious that both lines are identical so the answer would be 20. The important thing would be that even if they are completely identical, as in the case above, there should not be an error message (it should be returned).

在这个替代示例中,答案应为12:

With this alternate example, the answer should be 12:

    Cols 1-20
row1  ghuytuthjilujshdftgu 
row2  XXXXXXXXjilujshdftgu

以下是一些用于生成数据帧的代码:

Here is some code to generate the data frames:

r1 <- "ghuytuthjilujshdftgu"
r2 <- "ghuytuthjilujshdftgu"
df1 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r1, ""))))

r1 <- "ghuytuthjilujshdftgu"
r2 <- "XXXXXXXXjilujshdftgu"
df1 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r1, ""))))

编辑.

该对象的类是data.frame,它是可子集化的-dim = 2,20(每个列/字符均可单独访问)

the class of the object is data.frame and it is subsettable- with dim = 2,20 (each column / character is accessible on its own)

推荐答案

以下是将df分为两部分(从中心向左和向右,向左重新排序,以便从中心到第一个值进行计数)的答案,然后通过使用cumsum和NA来计算长度,以便一旦出现不匹配时cumsum就会变为NA,然后找到不是NA的最高索引值来代表从中心开始的最长拉伸而没有不匹配.

Here is an answer that splits the df into two pieces (left and right from center, reordering left so that it counts from center to first value), and then counts length by using cumsum and NA, so that cumsum turns to NA as soon as there is a mismatch, and then finds the highest index value that is not NA to represent the longest stretch starting from center without a mismatch.

sim_len <- function(df, center=floor(ncol(df) / 2)) {
  dfs <- list(df[, max(center, 1):1, drop=F], df[, center:ncol(df), drop=F])
  df.count <- lapply(dfs, function(df) {
    diff <- cumsum(ifelse(df[1, ] == df[2, ], 1, NA_integer_))
    diff[max(which(!is.na(diff)))]
  })
  max(0L, sum(unlist(df.count)) - 1L)  
}

这是运行它的一些示例(as.data.frame业务只是从字符串创建数据帧.请注意,"center"列被计数两次,因此函数的最后一行为-1L

And here are some examples of running it (the as.data.frame business is just creating the data frame from the character strings. Note that the "center" column is counted twice, hence the -1L in the final line of the function.

r1 <- "ghuytuthjilujshdftgu"
r2 <- "ghuytuthjilujshdftgu"
df1 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r1, ""))))
sim_len(df1)
# [1] 20

r1 <- "ghuytut3jilujshdftgu"
r2 <- "ghuytuthjilujshdftgu"
df2 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r2, ""))))
sim_len(df2)
# [1] 12

r1 <- "ghuytut3jilujshdftgu"
r2 <- "ghuytuthjilujxhdftgu"
df3 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r2, ""))))
sim_len(df3)
# [1] 5

r1 <- "ghuytut3xilujshdftgu"
r2 <- "ghuytuthjixujxhdftgu"
df4 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r2, ""))))
sim_len(df4)
# [1] 1


报告左计数和右计数的变体.请注意,中心"在左侧和右侧均计数,因此左+右之和比原始函数报告的值大1:


A variation that reports both left and right counts. Note that the "center" is counted in both left and right, so sum of left + right is 1 greater than what reported by original function:

sim_len2 <- function(df, center=floor(ncol(df) / 2)) {
  dfs <- list(left=df[, max(center, 1):1, drop=F], right=df[, center:ncol(df), drop=F])
  vapply(dfs, 
    function(df) {
      diff <- cumsum(ifelse(df[1, ] == df[2, ], 1, NA_integer_))
      diff[max(which(!is.na(diff)))]
      },
      numeric(1L)
) }
sim_len2(df1)
# left right 
#   10    11
sim_len2(df4, 4)
# left right 
#    4     4 

这篇关于R中模式的匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆