对列表元素的子集使用lapply,并返回与R中原始元素长度相同的列表 [英] Use lapply on a subset of list elements and return list of same length as original in R

查看:61
本文介绍了对列表元素的子集使用lapply,并返回与R中原始元素长度相同的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用lapply将正则表达式操作应用于列表元素(字符串)的子集,并返回与原始元素长度相同的列表.列表元素是长字符串(从阅读长文本文件并将段落折叠为单个字符串中得出). regex操作仅对列表元素/字符串的子集有效.我希望未分配的列表元素(字符串)以其原始状态返回.

I want to apply a regex operation to a subset of list elements (which are character strings) using lapply and return a list of same length as the original. The list elements are long strings (derived from reading in long text files and collapsing paragraphs into a single string). The regex operation is valid only for the subset of list elements/strings. I want the non-subsetted list elements (character strings) to be returned in their original state.

正则表达式操作来自stringr包中的str_extract,即我想从更长的字符串中提取子字符串.我根据文件名中的正则表达式模式对列表元素进行了子集设置.

The regex operation is str_extract from the stringr package, i.e. I want to extract a substring from a longer string. I subset the list elements based on a regex pattern in the filename.

具有简化数据的示例:

library(stringr)
texts <- as.list(c("abcdefghijkl", "mnopqrstuvwxyz", "ghijklmnopqrs", "uvwxyzabcdef"))
filenames <- c("AB1997R.txt", "BG2000S.txt", "MN1999R.txt", "DC1997S.txt")
names(texts) <- filenames
regexp <- "abcdef"

我预先知道我要对哪些字符串应用正则表达式操作,因此我想对这些字符串进行子集化.也就是说,我不想对列表中的所有元素运行正则表达式,因为这样做将返回一些无效的结果(在此简化示例中并不明显).

I know in advance to which strings I want to apply the regex operation, and hence I want to subset these strings. That is, I don't want to run the regex over all elements in the list, as doing so will return some invalid results (which is not apparent in this simplified example).

我已经做了一些天真的尝试,例如:

I've made a few naive efforts, e.g.:

x <- lapply(texts[str_detect(names(texts), "1997")], str_extract, regexp)
> x
$AB1997R.txt
[1] "abcdef"

$DC1997S.txt
[1] "abcdef"

返回一个缩减长度的列表,其中仅包含找到的子字符串. 但是我想要得到的结果是:

which returns a reduced-length list containing just the substrings found. But the results I want to get are:

> x
$AB1997R.txt
[1] "abcdef"

$BG2000S.txt
[1] "mnopqrstuvwxyz"

$MN1999R.txt
[1] "ghijklmnopqrs"

$DC1997S.txt
[1] "abcdef"

其中不包含正则表达式模式的字符串以其原始状态返回.

where the strings not containing the regex pattern are returned in their original state.

我已经向自己介绍了stringrlapplyllply(在plyr程序包中),但是许多操作都是以数据框为例进行说明的,而不是列表,并且不涉及对字符的正则表达式操作字符串.我可以使用for循环来实现我的目标,但是正如我通常所建议的那样,我试图摆脱这种情况,并更好地使用函数的apply-class.

I have informed myself about stringr, lapply and llply (in the plyr package), but many operations are illustrated using dataframes as examples, not lists, and don't involve regex operations on character strings. I can achieve my goal using a for loop, but I'm trying to get away from that, as is generally advised, and get better at using the apply-class of functions.

推荐答案

您可以使用子集运算符[<-:

You can use the subset operator [<-:

x <- texts
is1997 <- str_detect(names(texts), "1997")
x[is1997] <- lapply(texts[is1997], str_extract, regexp)
x
# $AB1997R.txt
# [1] "abcdef"
#
# $BG2000S.txt
# [1] "mnopqrstuvwxyz"
#
# $MN1999R.txt
# [1] "ghijklmnopqrs"
#
# $DC1997S.txt
# [1] "abcdef"
#

这篇关于对列表元素的子集使用lapply,并返回与R中原始元素长度相同的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆