R在R中提取字符串的一部分 [英] R extract a part of a string in R
问题描述
这里的名称是1007_s_at:123:381、10073_s_at:128:385等等.
我正在使用lapply函数,但是这花费了太多时间.我还有其他几个类似的文件.您是否可以提出一种更快的方法来做到这一点?
nm = c("probe:HG-Focus:1007_s_at:123:381; Interrogation_Position = 3570;反义;","probe:HG-Focus:1007_s_at:128:385; Interrogation_Position = 3615;反义;","probe:HG-Focus:1007_s_at:133:441; Interrogation_Position = 3786;反义;",探针:HG焦点:1007_s_at:142:13; Interrogation_Position = 3878;反义;","probe:HG-Focus:1007_s_at:156:191; Interrogation_Position = 3443;反义;,"probe:HTABC:1007_s_at:244:391; Interrogation_Position = 3793;反义;)extractProbe<-function(x)sub("probe:",",strsplit(x,;",fixed = TRUE)[[1]] [1],ignore.case = TRUE)pr = lapply(nm,extractProbe)
输出
1007_s_at:123:3811007_s_at:128:3851007_s_at:133:4411007_s_at:142:131007_s_at:156:1911007_s_at:244:391
使用正则表达式:
sub("probe:(.*?):(.*?);.* $","\\ 2",nm,perl = TRUE)
一些解释:
-
.
表示任何字符". -
.*
的意思是任意数量的字符". -
.*?
的意思是任意数量的字符,但不要贪婪. 捕获括号内的 - 模式并将其分配给
\\ 1
,\\ 2
等. -
$
表示行(或字符串)的结尾.
因此,这里的模式与整行匹配,并通过两个(.*?)
捕获两件事: HG-Focus
(或其他)事物不想为 \\ 1
,而您的ID为 \\ 2
.通过将替换设置为 \\ 2
,我们有效地将整个字符串替换为您的ID.
我现在意识到没有必要捕获第一件事,所以这同样适用:
sub("probe:.* ?:(.*?);.* $","\\ 1",nm,perl = TRUE)
I have 5 million sequences (probes to be specific) as below. I need to extract the name from each string.
The names here are 1007_s_at:123:381, 10073_s_at:128:385 and so on..
I am using lapply function but it is taking too much time. I have several other similar files. Would you suggest a faster way to do this.
nm = c(
"probe:HG-Focus:1007_s_at:123:381; Interrogation_Position=3570; Antisense;",
"probe:HG-Focus:1007_s_at:128:385; Interrogation_Position=3615; Antisense;",
"probe:HG-Focus:1007_s_at:133:441; Interrogation_Position=3786; Antisense;",
"probe:HG-Focus:1007_s_at:142:13; Interrogation_Position=3878; Antisense;" ,
"probe:HG-Focus:1007_s_at:156:191; Interrogation_Position=3443; Antisense;",
"probe:HTABC:1007_s_at:244:391; Interrogation_Position=3793; Antisense;")
extractProbe <- function(x) sub("probe:", "", strsplit(x, ";", fixed=TRUE)[[1]][1], ignore.case=TRUE)
pr = lapply(nm, extractProbe)
Output
1007_s_at:123:381
1007_s_at:128:385
1007_s_at:133:441
1007_s_at:142:13
1007_s_at:156:191
1007_s_at:244:391
Using regular expressions:
sub("probe:(.*?):(.*?);.*$", "\\2", nm, perl = TRUE)
A bit of explanation:
.
means "any character"..*
means "any number of characters"..*?
means "any number of characters, but do not be greedy.- patterns within parenthesis are captured and assigned to
\\1
,\\2
, etc. $
means end of the line (or string).
So here, the pattern matches the whole line, and captures two things via the two (.*?)
: the HG-Focus
(or other) thing you don't want as \\1
and your id as \\2
. By setting the replacement to \\2
, we are effectively replacing the whole string with your id.
I now realize it was not necessary to capture the first thing, so this would work just as well:
sub("probe:.*?:(.*?);.*$", "\\1", nm, perl = TRUE)
这篇关于R在R中提取字符串的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!