识别通用模式 [英] Identify a common pattern

查看:100
本文介绍了识别通用模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有(容易的)可能性来识别两个字符串共享的共同模式? 这是一个小例子,可以弄清楚我的意思:

Is there a (easy) possibility to identify a common pattern which two strings share? Here is a little example to make clear what I mean:

我有两个包含字符串的变量.两者都包含相同的模式("ABC")和一些噪声".

I have two variables containing a string. Both include the same pattern ("ABC") and also some "noise".

a <- "xxxxxxxxxxxABCxxxxxxxxxxxx"
b <- "yyyyyyyyyyyyyyyyyyyyyyyABC"

让我们说我不知道​​通用模式,我希望R找出两个字符串都包含"ABC".我怎样才能做到这一点?

Lets say I don't know the common pattern and I want R to find out that both strings contain "ABC". How can I do this?

*编辑

第一个例子可能有点简单化.这是我真实数据中的一个例子.

The first example was maybe a bit to simplistic. Here is a example from my real data.

a <- "DUISBURG-HAMBORNS"
b <- "DUISBURG (-31.7.29)S"

两个字符串都包含"DUISBURG",我想让函数识别.

Both strings contain "DUISBURG" which I want the function to identify.

*编辑

我采取了在评论中发布的链接中提出的解决方案.但是我仍然没有我想要的东西.

I took the solution proposed in the link posted in the comments. But I still have not exactly what I want.

library(qualV)
LCS(strsplit(a[1], '')[[1]],strsplit(b[1], '')[[1]])$LCS

[1] "D" "U" "I" "S" "B" "U" "R" "G" "-" " " " " "S"

如果函数正在寻找两个向量的最长公共子序列,为什么在"D" "U" "I" "S" "B" "U" "R" "G"之后它不停止? .

If the function is looking for the longest common subsequence of the two vectors, why does it not stop after "D" "U" "I" "S" "B" "U" "R" "G"? .

推荐答案

qualV包中的功能LCS(在最长的公共子序列问题,在这种情况下,不需要子序列来占据原始序列中的连续位置序列.

Function LCS from qualV package (in Find common substrings between two character variables; not a possible duplicate) does something else than what you need. It solves the longest common subsequence problem, where subsequences are not required to occupy consecutive positions within the original sequences.

您所拥有的是最长的常见子字符串问题,您可以使用算法,以下代码假定存在唯一性(就长度而言)最长的公共子字符串:

What you have is the longest common substring problem, for which you could use this algorithm, and here is the code assuming that there is a unique (in terms of length) longest common substring:

a <- "WWDUISBURG-HAMBORNS"
b <- "QQQQQQDUISBURG (-31.7.29)S"

A <- strsplit(a, "")[[1]]
B <- strsplit(b, "")[[1]]

L <- matrix(0, length(A), length(B))
ones <- which(outer(A, B, "=="), arr.ind = TRUE)
ones <- ones[order(ones[, 1]), ]
for(i in 1:nrow(ones)) {
  v <- ones[i, , drop = FALSE]
  L[v] <- ifelse(any(v == 1), 1, L[v - 1] + 1)
}
paste0(A[(-max(L) + 1):0 + which(L == max(L), arr.ind = TRUE)[1]], collapse = "")
# [1] "DUISBURG"

这篇关于识别通用模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆