R中的重叠比赛 [英] Overlapping matches in R

查看:94
本文介绍了R中的重叠比赛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已搜索并找到此论坛讨论以实现重叠匹配的效果.

I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.

我还发现了以下 SO 问题,查找索引以执行此任务,但找不到任何有关在R语言中抓取重叠匹配项的简明扼要的信息.

I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.

在执行时,我可以通过使用积极先行断言来使用支持( PCRE )的大多数语言来执行此任务前瞻内部的捕获组以捕获重叠的匹配项.

I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.

但是,尽管实际上以与其他语言相同的方式执行此操作,但在R中使用perl=T却没有结果.

But, while actually performing this the same way I would in other languages, using perl=T in R, no results yield.

> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""

同时使用stringistringr软件包也是如此.

The same goes for using both the stringi and stringr package.

> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""

执行此操作时应返回的正确结果是:

The correct results that should be returned when executing this are:

[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

编辑

  1. 我很清楚regmatches在捕获的比赛中不能很好地工作,但是恰好是什么引起了regmatch中的这种行为,为什么没有返回结果? 我正在寻找一个更详细的答案.

  1. I am well aware that regmatches does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.

stringistringr软件包不能通过regmatches执行此操作吗?

Is the stringi and stringr package not capable of performing this over regmatches?

请随时添加到我的答案中,或者提出与我发现不同的解决方法.

Please feel free to add to my answer or come up with a different workaround than I have found.

推荐答案

标准regmatches不适用于捕获的匹配项(特别是同一字符串中的多个捕获的匹配项).在这种情况下,由于您要匹配"前瞻(忽略捕获),因此匹配本身为零长度.还有一个regmatches()<-函数可以说明这一点.烦人

The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve

x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"

请注意所有字母的保存方式,我们只是将零长度匹配的位置替换为可以观察到的内容.

Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.

我创建了一个 regcapturedmatches()函数,我经常将其用于此类任务.例如

I've created a regcapturedmatches() function that I often use for such tasks. For example

x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

gregexpr可以很好地捕获所有数据,因此,如果您不想使用此帮助器功能,则可以随时从该对象中提取数据.

The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.

这篇关于R中的重叠比赛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆