根据模式提取子串 [英] Extract a substring according to a pattern

查看:54
本文介绍了根据模式提取子串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个字符串列表:

Suppose I have a list of string:

string = c("G1:E001", "G2:E002", "G3:E003")

现在我希望得到一个只包含冒号:"之后部分的字符串向量,即substring = c(E001,E002,E003).

Now I hope to get a vector of string that contains only the parts after the colon ":", i.e substring = c(E001,E002,E003).

在 R 中有没有方便的方法来做到这一点?使用 substr?

Is there a convenient way in R to do this? Using substr?

推荐答案

这里有几个方法:

1) 子

sub(".*:", "", string)
## [1] "E001" "E002" "E003"

2) strsplit

sapply(strsplit(string, ":"), "[", 2)
## [1] "E001" "E002" "E003"

3) read.table

read.table(text = string, sep = ":", as.is = TRUE)$V2
## [1] "E001" "E002" "E003"

4) 子串

这假设第二部分总是从第 4 个字符开始(问题中的示例就是这种情况):

This assumes second portion always starts at 4th character (which is the case in the example in the question):

substring(string, 4)
## [1] "E001" "E002" "E003"

4a) 子字符串/正则表达式

如果冒号并不总是处于已知位置,我们可以通过搜索来修改(4):

If the colon were not always in a known position we could modify (4) by searching for it:

substring(string, regexpr(":", string) + 1)

5) 绑带

strapplyc 返回括号中的部分:

library(gsubfn)
strapplyc(string, ":(.*)", simplify = TRUE)
## [1] "E001" "E002" "E003"

6) read.dcf

这个只有在冒号之前的子字符串是唯一的(它们在问题的例子中)时才有效.它还要求分隔符是冒号(这是问题所在).如果使用了不同的分隔符,那么我们可以先使用 sub 将其替换为冒号.例如,如果分隔符是 _ 那么 string <- sub("_", ":", string)

This one only works if the substrings prior to the colon are unique (which they are in the example in the question). Also it requires that the separator be colon (which it is in the question). If a different separator were used then we could use sub to replace it with a colon first. For example, if the separator were _ then string <- sub("_", ":", string)

c(read.dcf(textConnection(string)))
## [1] "E001" "E002" "E003"

7) 分开

7a) 使用 tidyr::separate 我们创建一个包含两列的数据框,一列用于冒号前的部分,一列用于后,然后提取后者.

7a) Using tidyr::separate we create a data frame with two columns, one for the part before the colon and one for after, and then extract the latter.

library(dplyr)
library(tidyr)
library(purrr)

DF <- data.frame(string)
DF %>% 
  separate(string, into = c("pre", "post")) %>% 
  pull("post")
## [1] "E001" "E002" "E003"

7b) 或者,separate 可用于只创建 post 列,然后 unlistunname 结果数据框:

7b) Alternately separate can be used to just create the post column and then unlist and unname the resulting data frame:

library(dplyr)
library(tidyr)

DF %>% 
  separate(string, into = c(NA, "post")) %>% 
  unlist %>%
  unname
## [1] "E001" "E002" "E003"

8)trimws 我们可以使用trimws 来修剪左边的单词字符,然后再次使用它来修剪冒号.

8) trimws We can use trimws to trim word characters off the left and then use it again to trim the colon.

trimws(trimws(string, "left", "\\w"), "left", ":")
## [1] "E001" "E002" "E003"

注意

假设输入string为:

string <- c("G1:E001", "G2:E002", "G3:E003")

这篇关于根据模式提取子串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆