基于复杂规则识别子串 [英] Identifying substrings based on complex rules

查看:21
本文介绍了基于复杂规则识别子串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有如下所示的文本字符串:

Assume I have text strings that look something like this:

A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3

这里我想识别导致A是一个标记,I3是一个标记等)> 到由 onlyIX 标记(即 I1I2I3) 包含一个 I3.这个子序列的长度可以是 1(即是单个 I3 标记),也可以是无限长度,但始终需要包含至少 1 个 I3 标记,并且只能包含 IX 标记.在通向IX 子序列的子序列中,可以包含I1I2,但不能包含I3.

Here I want to identify sequences of markers (A is a marker, I3 is a marker etc.) that leads up to a subsequence consisting only of IX markers (i.e. I1, I2, or I3) that contains an I3. This subsequence can have a length of 1 (i.e. be a single I3 marker) or it can be of unlimited length, but always needs to contain at least 1 I3 marker, and can only contain IX markers. In the subsequence that leads up to the IX subsequence, I1 and I2 can be included, but never I3.

在上面的字符串中我需要识别:

In the string above I need to identify:

A-B-C-I1-I2-D-E-F

导致包含 I3

D-D-D-D

导致 I1-I1-I2-I1-I1-I3-I3 子序列,其中至少包含 1 个 I3.

which leads up to the I1-I1-I2-I1-I1-I3-I3 subsequence that contains at least 1 I3.

这里有一些额外的例子:

Here are a few additional examples:

A-B-I3-C-I3

从这个字符串我们应该识别AB,因为它后面是一个包含I3的1的子序列,还有C,因为它后跟包含 I3 的 1 子序列.

from this string we should identify A-B because it is followed by a subsequence of 1 that contains I3, and also C, because it is followed by a subsequence of 1 that contains I3.

和:

I3-A-I3

这里应该标识A,因为它后面跟着一个包含I3的子序列1.第一个 I3 本身不会被识别,因为我们只对后面跟着包含 I3IX 标记的子序列感兴趣.

here A should be identified because it is followed by a subsequence of 1 which contains I3. The first I3 itself will not be identified, because we are only interested in subsequences that are followed by a subsequence of IX markers that contains I3.

如何编写一个通用函数/正则表达式来完成这个任务?

How can I write a generic function/regex that accomplishes this task?

推荐答案

使用 strsplit

> x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3"
> strsplit(x, "(?:-?I\\d+)*-?\\bI3-?(?:I\\d+-?)*")
[[1]]
[1] "A-B-C-I1-I2-D-E-F" "D-D-D-D"

> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I\\d+-?)*")
[[1]]
[1] "A-B" "C" 

> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I3-?)*")
[[1]]
[1] "A-B" "C"

这篇关于基于复杂规则识别子串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆