提取和使用list&中包含的某些但不是全部字符串的多种模式组合多个子字符串返回R中的清单 [英] Extract & combine multiple substrings using multiple patterns from some but not all strings contained in list & return to list in R

查看:143
本文介绍了提取和使用list&中包含的某些但不是全部字符串的多种模式组合多个子字符串返回R中的清单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想找到一种优雅且易于操作的方式来实现:

I'd like to find an elegant and easily manipulable way to:

  1. 从某些(但不是全部)字符串中提取多个子字符串 作为列表元素包含的元素(每个列表元素仅由一个长字符串组成)
  2. 用这些多个子字符串替换相应的原始长字符串
  3. 将每个列表元素中的子字符串折叠为1个字符串
  4. 返回相同长度的列表,其中包含替换子字符串和适当的未修饰的长字符串.
  1. extract multiple substrings from some, but not all, strings that are contained as elements of a list (each list element consists of just one long string)
  2. replace the respective original long string with these multiple substrings
  3. collapse the substrings in each list element into 1 string
  4. return a list of same length containing the replacement substrings and the untouched long strings as appropriate.

此问题是我先前提出的问题的后续(尽管有所不同):.请注意,我不想在 all 列表元素上运行正则表达式模式,而只对正则表达式适用的那些元素运行.

This question is a follow-on (though different) from my earlier question: replace strings of some list elements with substring. Note, I don't want to run the regex patterns over all list elements, only those elements to which the regex applies.

我知道最终结果可以由str_replacesub传递,方法是匹配要更改的整个字符串并返回捕获组捕获的文本,如下所示:

I know the end result can be delivered by str_replace or sub by matching the entire strings to be changed and returning the text captured by capturing groups, as follows:

library(stringr)
myList <- as.list(c("OneTwoThreeFourFive", "mnopqrstuvwxyz", "ghijklmnopqrs", "TwentyTwoFortyFourSixty"))
fileNames <- c("AB1997R.txt", "BG2000S.txt", "MN1999R.txt", "DC1997S.txt")
names(myList) <- fileNames
is1997 <- str_detect(names(myList), "1997")

regexp <- ".*(Two).*(Four).*"
myListNew2 <- myList
myListNew2[is1997] <- lapply(myList[is1997], function(i) str_replace(i, regexp, "\\1££\\2"))

## This does return what I want:
myListNew2
$AB1997R.txt
[1] "Two££Four"

$BG2000S.txt
[1] "mnopqrstuvwxyz"

$MN1999R.txt
[1] "ghijklmnopqrs"

$DC1997S.txt
[1] "Two££Four"

但是我宁愿不必匹配整个原始文本(因为,例如,匹配很长的文本所需的时间;多个正则表达式模式的复杂性以及将它们编织在一起以使其成功匹配整个字符串的难度) ).我想使用单独的正则表达式模式来提取子字符串,然后用这些提取物替换原始字符串.我想出了以下可行的方法.但是,肯定有一种更简便,更好的方法! llply?

But I would prefer do it without having to match the entire original text (because, e.g., of time required for matching very long texts; of complexity of multiple regex patterns & difficulty of knitting them together so they match entire strings successfully). I would like to use separate regex patterns to extract the substrings and then replace the original string with these extracts. I came up with the following, which works. But surely there is an easier, better way! llply?

patternA <- "Two"
patternB <- "Four"
x <- myList[is1997]
x2 <- unlist(x)
stringA <- str_extract (x2, patternA)
stringB <- str_extract (x2, patternB)
x3 <- mapply(FUN=c, stringA, stringB, SIMPLIFY=FALSE)
x4 <- lapply(x3, function(i) paste(i, collapse = "££"))
x5 <- relist(x4,x2)
myListNew1 <- replace(myList, is1997, x5)
myListNew1

$AB1997R.txt
[1] "Two££Four"

$BG2000S.txt
[1] "mnopqrstuvwxyz"

$MN1999R.txt
[1] "ghijklmnopqrs"

$DC1997S.txt
[1] "Two££Four"

推荐答案

也许是这样的,在这里我扩展了您正在寻找的模式以显示它如何变得适应性:

Something like this maybe, where I've extended the patterns you are looking for to show how it could become adaptable:

library(stringr)
patterns <- c("Two","Four","Three")
hits <- lapply(myList[is1997], function(x) {
  out <- sapply(patterns, str_extract, string=x)
  paste(out[!is.na(out)],collapse="££")
})
myList[is1997] <- hits

#[[1]]
#[1] "Two££Four££Three"
#
#[[2]]
#[1] "mnopqrstuvwxyz"
#
#[[3]]
#[1] "ghijklmnopqrs"
#
#[[4]]
#[1] "Two££Four"

这篇关于提取和使用list&amp;中包含的某些但不是全部字符串的多种模式组合多个子字符串返回R中的清单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆