R在R中提取字符串的一部分 [英] R extract a part of a string in R

查看：141 发布时间：2021/5/28 20:22:26 string r lapply

本文介绍了R在R中提取字符串的一部分的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有500万个序列(具体来说是探针)，如下所示.我需要从每个字符串中提取名称.

这里的名称是1007_s_at:123:381、10073_s_at:128:385等等.

我正在使用lapply函数，但是这花费了太多时间.我还有其他几个类似的文件.您是否可以提出一种更快的方法来做到这一点?

  nm = c("probe:HG-Focus:1007_s_at:123:381; Interrogation_Position = 3570;反义;"，"probe:HG-Focus:1007_s_at:128:385; Interrogation_Position = 3615;反义;"，"probe:HG-Focus:1007_s_at:133:441; Interrogation_Position = 3786;反义;"，探针:HG焦点:1007_s_at:142:13； Interrogation_Position = 3878；反义；"，"probe:HG-Focus:1007_s_at:156:191; Interrogation_Position = 3443;反义；，"probe:HTABC:1007_s_at:244:391; Interrogation_Position = 3793;反义；)extractProbe<-function(x)sub("probe:"，"，strsplit(x，;"，fixed = TRUE)[[1]] [1]，ignore.case = TRUE)pr = lapply(nm，extractProbe)

输出

  1007_s_at:123:3811007_s_at:128:3851007_s_at:133:4411007_s_at:142:131007_s_at:156:1911007_s_at:244:391

解决方案

使用正则表达式:

  sub("probe:(.*?):(.*?);.* $"，"\\ 2"，nm，perl = TRUE)

一些解释:

.表示任何字符".
.* 的意思是任意数量的字符".
.*?的意思是任意数量的字符，但不要贪婪.
模式并将其分配给 \\ 1 ， \\ 2 等.
$ 表示行(或字符串)的结尾.

因此，这里的模式与整行匹配，并通过两个(.*?)捕获两件事: HG-Focus (或其他)事物不想为 \\ 1 ，而您的ID为 \\ 2 .通过将替换设置为 \\ 2 ，我们有效地将整个字符串替换为您的ID.

我现在意识到没有必要捕获第一件事，所以这同样适用:

  sub("probe:.* ?:(.*?);.* $"，"\\ 1"，nm，perl = TRUE)

I have 5 million sequences (probes to be specific) as below. I need to extract the name from each string.

The names here are 1007_s_at:123:381, 10073_s_at:128:385 and so on..

I am using lapply function but it is taking too much time. I have several other similar files. Would you suggest a faster way to do this.

 nm = c(
  "probe:HG-Focus:1007_s_at:123:381; Interrogation_Position=3570; Antisense;",
  "probe:HG-Focus:1007_s_at:128:385; Interrogation_Position=3615; Antisense;",
  "probe:HG-Focus:1007_s_at:133:441; Interrogation_Position=3786; Antisense;",
  "probe:HG-Focus:1007_s_at:142:13; Interrogation_Position=3878; Antisense;" ,
  "probe:HG-Focus:1007_s_at:156:191; Interrogation_Position=3443; Antisense;",
  "probe:HTABC:1007_s_at:244:391; Interrogation_Position=3793; Antisense;")

extractProbe <- function(x) sub("probe:", "", strsplit(x, ";", fixed=TRUE)[[1]][1], ignore.case=TRUE)
pr = lapply(nm, extractProbe)

Output

1007_s_at:123:381
1007_s_at:128:385
1007_s_at:133:441
1007_s_at:142:13
1007_s_at:156:191
1007_s_at:244:391

解决方案

Using regular expressions:

sub("probe:(.*?):(.*?);.*$", "\\2", nm, perl = TRUE)

A bit of explanation:

. means "any character".
.* means "any number of characters".
.*? means "any number of characters, but do not be greedy.
patterns within parenthesis are captured and assigned to \\1, \\2, etc.
$ means end of the line (or string).

So here, the pattern matches the whole line, and captures two things via the two (.*?): the HG-Focus (or other) thing you don't want as \\1 and your id as \\2. By setting the replacement to \\2, we are effectively replacing the whole string with your id.

I now realize it was not necessary to capture the first thing, so this would work just as well:

sub("probe:.*?:(.*?);.*$", "\\1", nm, perl = TRUE)

这篇关于R在R中提取字符串的一部分的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R在R中提取字符串的一部分 [英] R extract a part of a string in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R在R中提取字符串的一部分 [英] R extract a part of a string in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭