查找字符串中的重叠长度 [英] Find length of overlap in strings

查看：66 发布时间：2021/4/15 19:46:22 r string bioinformatics overlap dna-sequence

本文介绍了查找字符串中的重叠长度的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

您知道任何一种现成的方法来获取长度以及两个字符串的重叠吗?但是，仅使用 R ，也许来自 stringr 的东西?不幸的是，我一直在看这里.

do you know any ready-to-use method to obtain length and also overlap of two strings? However only with R, maybe something from stringr? I was looking here, unfortunately without succes.

str1 <- 'ABCDE'
str2 <- 'CDEFG'

str_overlap(str1, str2)
'CDE'

str_overlap_len(str1, str2)
3

其他示例:

str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'

str_overlap(str1, str2)
'CCTG'

str_overlap_len(str1, str2)
4

///

str1 <- 'foobarandfoo'
str2 <- 'barand'

str_overlap(str1, str2)
'barand'

str_overlap_len(str1, str2)
6

///是的两个解决方案，始终选择始终重叠

/// Yes two solutions, always pick always overlap

str1 <- 'EFGABCDE'
str2 <- 'ABCDECDE'

str_overlap(str1, str2)
'ABCDE'

str_overlap_len(str1, str2)
5

我很想知道自制的小功能，例如

I was wonder about homemade small function for this, such as this one?

推荐答案

在我看来，您(OP)对代码的性能不是很在意，但对没有现成功能的潜在解决方案更感兴趣.因此，这是我想出的最长公共子字符串示例.我必须注意，即使可能存在多个相同长度的字符串，它也仅返回找到的第一个最大的公共子字符串.您可以根据自己的需要进行修改.而且请不要指望它会超级快-不会.

It seems to me that you (OP) are not very concerned with performance of the code but more interested in a potential approch to solve it without readymade functions. So here is an example I came up with to compute the longest common substring. I have to note that this only returns the first largest common substring found even when there can be several of the same length. This is something you could modify to fit your needs. And please don't expect this to be super fast - it won't.

foo <- function(str1, str2, ignore.case = FALSE, verbose = FALSE) {

  if(ignore.case) {
    str1 <- tolower(str1)
    str2 <- tolower(str2)
  }

  if(nchar(str1) < nchar(str2)) {
    x <- str2
    str2 <- str1
    str1 <- x
  }

  x <- strsplit(str2, "")[[1L]]
  n <- length(x)
  s <- sequence(seq_len(n))
  s <- split(s, cumsum(s == 1L))
  s <- rep(list(s), n)

  for(i in seq_along(s)) {
    s[[i]] <- lapply(s[[i]], function(x) {
      x <- x + (i-1L)
      x[x <= n]
    })
    s[[i]] <- unique(s[[i]])
  }

  s <- unlist(s, recursive = FALSE)
  s <- unique(s[order(-lengths(s))])

  i <- 1L
  len_s <- length(s)
  while(i < len_s) {
    lcs <- paste(x[s[[i]]], collapse = "")
    if(verbose) cat("now checking:", lcs, "\n")
    check <- grepl(lcs, str1, fixed = TRUE)
    if(check) {
      cat("the (first) longest common substring is:", lcs, "of length", nchar(lcs), "\n")
      break
    } else {
      i <- i + 1L 
    }
  }
}

str1 <- 'ABCDE'
str2 <- 'CDEFG'
foo(str1, str2)
# the (first) longest common substring is: CDE of length 3 

str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'
foo(str1, str2)
# the (first) longest common substring is: CCTG of length 4

str1 <- 'foobarandfoo'
str2 <- 'barand'
foo(str1, str2)
# the (first) longest common substring is: barand of length 6 

str1 <- 'EFGABCDE'
str2 <- 'ABCDECDE'
foo(str1, str2)
# the (first) longest common substring is: ABCDE of length 5 


set.seed(2018)
str1 <- paste(sample(c(LETTERS, letters), 500, TRUE), collapse = "")
str2 <- paste(sample(c(LETTERS, letters), 250, TRUE), collapse = "")

foo(str1, str2, ignore.case = TRUE)
# the (first) longest common substring is: oba of length 3 

foo(str1, str2, ignore.case = FALSE)
# the (first) longest common substring is: Vh of length 2

这篇关于查找字符串中的重叠长度的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

查找字符串中的重叠长度 [英] Find length of overlap in strings

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

查找字符串中的重叠长度 [英] Find length of overlap in strings

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭