在 R 中使用正则表达式进行拆分时忽略字符串的一部分 [英] Ignore part of a string when splitting using regular expression in R

查看:48
本文介绍了在 R 中使用正则表达式进行拆分时忽略字符串的一部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在某些特定点(破折号,-)分割 R 中的字符串(使用 strsplit),但是如果破折号位于方括号 ([).

I'm trying to split a string in R (using strsplit) at some specific points (dash, -) however not if the dash are within a string in brackets ([).

示例:

xx <- c("Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
xx
  [1] "Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
  [2] "Total Internet-Time Spent Online-Past 7 Days" 

应该给我类似的东西:

list(c("Radio Stations","Listened to Past Week","Toronto [FM-CFXJ-93.5 (93.5 The Move)]"), c("Total Internet","Time Spent Online","Past 7 Days"))
  [[1]]
  [1] "Radio Stations"                         "Listened to Past Week"                 
  [3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"

  [[2]]
  [1] "Total Internet"    "Time Spent Online" "Past 7 Days"  

有没有办法用正则表达式来做到这一点?破折号的位置和数量在向量的每个元素内发生变化,并且并不总是有括号.但是,当有括号时,它们总是在末尾.

Is there a way with regular expression to do this? The position and the number of dashs change within each elements of the vector, and there is not always brackets. However, when there are brackets, they are always at the end.

我尝试了不同的方法,但都没有奏效:

I've tried different things, but none are working:

## Trying to match "-" before "[" in Perl
strsplit(xx, split = "-(?=\\[)", perl=T)
# does nothing

## trying to first extract what follow "[" then splitting what is preceding that
temp <- strsplit(xx, "[", fixed = T)
temp <- lapply(temp, function(yy) substr(head(yy, -1),"-"))
# doesn't work as there are some elements with no brackets...

任何帮助将不胜感激.

推荐答案

要匹配一个不在 [] 内的 - 你必须匹配由 [] 括起来的字符串的一部分并省略它,并在所有其他上下文中匹配 - .在abc-def]中,-不在[]和acc之间.不应对规格进行拆分.

To match a - that is not inside [ and ] you must match a part of the string that is enclosed with [ and ] and omit it, and match - in all other contexts. In abc-def], the - is not in between [ and ] and acc. to the specs should not be split against.

它是通过这个正则表达式完成的:

\[[^][]*](*SKIP)(*FAIL)|-

这里,

  • \[ - 匹配一个 [
  • [^][]* - 除 [] 之外的零个或多个字符(如果您使用 [^]] 它将匹配任何字符,但 ])
  • ] - 文字 ]
  • (*SKIP)(*FAIL)- PCRE 动词省略匹配并使引擎在省略的结束后继续寻找匹配
  • | - 或
  • - - 其他上下文中的连字符.
  • \[ - matches a [
  • [^][]* - zero or more chars other than [ and ] (if you use [^]] it will match any char but ])
  • ] - a literal ]
  • (*SKIP)(*FAIL)- PCRE verbs that omit the match and make the engine go on looking for the match after the end of the omitted one
  • | - or
  • - - a hyphen in other contexts.

或者,匹配 [...[...] 之类的子字符串 (演示):

Or, to match [...[...] like substrings (demo):

\[[^]]*](*SKIP)(*FAIL)|-

或者,考虑嵌套方括号(演示):

Or, to account for nested square brackets (demo):

(\[(?:[^][]++|(?1))*])(*SKIP)(*FAIL)|-

这里,(\[(?:[^][]++|(?1))*]) 匹配并捕获 [,然后是 1+ 个字符除了 [](带有 [^][]++)或 (|) (?1) 递归整个捕获组 1 模式((...) 之间的整个部分).

Here, (\[(?:[^][]++|(?1))*]) matches and captures [, then 1+ chars other than [ and ] (with [^][]++) or (|) (?1) recurses the whole capturing group 1 pattern (the whole part between (...)).

查看 R 演示:

xx <- c("abc-def]", "Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
pattern <- "\\[[^][]*](*SKIP)(*FAIL)|-"
strsplit(xx, pattern, perl=TRUE)
# [[1]]
# [1] "abc"  "def]"
# [[2]]
# [1] "Radio Stations"                        
# [2] "Listened to Past Week"                 
# [3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
# [[3]]
# [1] "Total Internet"    "Time Spent Online" "Past 7 Days"      

pattern_recursive <- "(\\[(?:[^][]++|(?1))*])(*SKIP)(*FAIL)|-"
xx2 <- c("Radio Stations-Listened to Past Week-Toronto [[F[M]]-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
strsplit(xx2, pattern_recursive, perl=TRUE)
# [[1]]
# [1] "Radio Stations"                            
# [2] "Listened to Past Week"                     
# [3] "Toronto [[F[M]]-CFXJ-93.5 (93.5 The Move)]"

# [[2]]
# [1] "Total Internet"    "Time Spent Online" "Past 7 Days"   

这篇关于在 R 中使用正则表达式进行拆分时忽略字符串的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆