R在正则表达式中使用变量 [英] R use variable within regex

查看:90
本文介绍了R在正则表达式中使用变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的 - 也许这是一个更好的例子.我正在寻找有关如何在正则表达式中引用变量的指南/参考资料 - 而不是如何为此数据构建正则表达式.

如何使用变量中的值来正则表达式下一个变量?

库(plyr)图书馆(tm)图书馆(字符串)图书馆(gsubfn)

速度数据集

d1$sub <- c("左颈动脉狭窄:(50-69)大约 50-55% (0-49)小于 50% 常见:", "左颈动脉狭窄:(50-69) 大约 60-70% (0-49) 少于 50% 常见:","左颈动脉狭窄:(40-60) 大约 40% 不完整扫描见注释 (40-50) 少于 50% 常见:")d1$sub[1] 左颈动脉狭窄:(50-69) 约 50-55% (0-49) 少于 50% 常见:"[2]左颈动脉狭窄:(50-69) 约 60-70% (0-49) 少于 50% 常见:"[3]左颈动脉狭窄:(40-60) 约 40% 不完整扫描见注 (40-50) 少于 50% 常见:"

提取 sub1

d1$sub1 <- as.character(lapply((strapply(d1$sub,"((?<=左颈动脉狭窄:).{5,}?(?=(\\()|COMMON)))", perl=TRUE)), 唯一))d1$sub1[1] " (50-69) 约 50-55% "[2] "(50-69) 约 60-70% "[3](40-60)大约 40% 不完整扫描见注释"

现在引用 sub1 从数据中获取 sub2

想要返回(0-49)LESS THAN 50%"、(0-49)LESS THAN 50%"和(40-50)LESS THAN 50%"

d1$sub2 <- as.character(lapply((strapply(d1$sub,"((?<=\\d1$sub1).*?(?=COMMON))", perl=TRUE)),唯一的))d1$sub2[1] "NULL" "NULL" "NULL"

* 以下是原帖 **

我正在从文本报告中提取医疗信息,并尝试使用一个变量 ($sub1) 作为正则表达式的一部分来查找下一个变量 ($sub2).

如何使用变量中的值来正则表达式下一个变量?

库(plyr)图书馆(tm)图书馆(字符串)图书馆(gsubfn)#速度数据集d1 <- c("CCA: 135 cm/sec ICA: 50 cm/sec", "CCA: 150 cm/sec ICA: 75 cm/sec")d1[1] CCA:135 厘米/秒 ICA:50 厘米/秒"CCA:150 厘米/秒 ICA:75 厘米/秒"#Lookahead 获得 sub1d1$sub1 <- as.character(lapply((strapply(d1,"(.*?(?=ICA:))", perl=TRUE)), unique))警告信息:在 d1$sub1 <- as.character(lapply((strapply(d1, "(.*?(?=ICA:))", :强制 LHS 到列表d1[[1]][1]CCA:135 厘米/秒 ICA:50 厘米/秒"[[2]][1]CCA:150 厘米/秒 ICA:75 厘米/秒"$sub1[1]CCA:135 厘米/秒"CCA:150 厘米/秒"#现在引用 sub1 来获取 sub2 - 不起作用?#想要返回ICA:50 cm/sec"和ICA:75 cm/sec"#Used paste(d1$sub1) 尝试将 $sub1 变量放入正则表达式,但不起作用)d1$sub2 <- as.character(lapply((strapply(d1,"((?<=paste(d1$sub1)).*?)", perl=TRUE)), unique))d1$sub2[1] "NULL" "NULL" "NULL"

文本有结构,但在长度、内容等方面变化很大.定义第一个变量 ($sub1) 很容易,但使用它来定义第二个变量将是最精确的.

也许我应该强调文本是非常可变的 - 因此基于文本模式的简单正则表达式将不起作用.我需要使用第一个变量在文本中定位第二个变量.这是医疗信息,所以我不能发布实际数据.

解决方案

尝试使用 paste0() 函数.这会将您要使用的所有变量和任何正则表达式放在一起.

grep(paste0("^.*", 变量, ".*$"), d1)

如果您的变量可以有 >1 个元素,您还可以将参数 collapse = "" 添加到 paste0()

Okay - maybe this is a better example. I am looking for guidance/references on how to reference a variable within a regex - not how to build a regex for this data.

How can you use a value from a variable to regex the next variable?

library(plyr)    
library(tm)
library(stringr)
library(gsubfn)

Dataset of velocities

d1$sub <- c("LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 50-55% (0-49)LESS THAN 50%     COMMON:", "LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 60-70% (0-49)LESS THAN 50% COMMON:", "LEFT CAROTID STENOSIS: (40-60)APPROXIMATELY 40% INCOMPLETE SCAN SEE NOTES (40-50)LESS THAN 50% COMMON:")

d1$sub
[1] "LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 50-55% (0-49)LESS THAN 50% COMMON:"                        
[2] "LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 60-70% (0-49)LESS THAN 50% COMMON:"                        
[3] "LEFT CAROTID STENOSIS: (40-60)APPROXIMATELY 40% INCOMPLETE SCAN SEE NOTES (40-    50)LESS THAN 50% COMMON:"

extract sub1

d1$sub1 <- as.character(lapply((strapply(d1$sub,"((?<=LEFT CAROTID STENOSIS:).{5,}?(?=(\\(|COMMON)))", perl=TRUE)), unique))
d1$sub1
[1] " (50-69)APPROXIMATELY 50-55% "                       
[2] " (50-69)APPROXIMATELY 60-70% "                       
[3] " (40-60)APPROXIMATELY 40% INCOMPLETE SCAN SEE NOTES "

Now reference sub1 to get sub2 from the data

Want to return "(0-49)LESS THAN 50%", "(0-49)LESS THAN 50%", And "(40-50)LESS THAN 50%"

d1$sub2 <- as.character(lapply((strapply(d1$sub,"((?<=\\d1$sub1).*?(?=COMMON))", perl=TRUE)), unique))
d1$sub2
[1] "NULL" "NULL" "NULL"

* Original Post Below **

I am extracting medical information from text reports, and am attempting to use one variable ($sub1) as part of a regex to find the next variable ($sub2).

How can you use a value from a variable to regex the next variable?

library(plyr)
library(tm)
library(stringr)
library(gsubfn)

#Dataset of velocities
d1 <- c("CCA: 135 cm/sec ICA: 50 cm/sec", "CCA: 150 cm/sec ICA: 75 cm/sec")
d1
[1] "CCA: 135 cm/sec ICA: 50 cm/sec" "CCA: 150 cm/sec ICA: 75 cm/sec"

#Lookahead to get sub1
d1$sub1 <- as.character(lapply((strapply(d1,"(.*?(?=ICA:))", perl=TRUE)), unique))
Warning message:
In d1$sub1 <- as.character(lapply((strapply(d1, "(.*?(?=ICA:))",  :
 Coercing LHS to a list
d1
[[1]]
[1] "CCA: 135 cm/sec ICA: 50 cm/sec"

[[2]]
[1] "CCA: 150 cm/sec ICA: 75 cm/sec"

$sub1
[1] "CCA: 135 cm/sec " "CCA: 150 cm/sec "

#Now reference sub1 to get sub2 - does not work?
#Want to return "ICA:50 cm/sec" and "ICA:75 cm/sec"
#Used paste(d1$sub1) to try getting the $sub1 variable into the regex, but doesn't work)
d1$sub2 <- as.character(lapply((strapply(d1,"((?<=paste(d1$sub1)).*?)", perl=TRUE)), unique))
d1$sub2
[1] "NULL" "NULL" "NULL"

The text has structure, but is very variable in terms of length, content, etc. Defining the first variable ($sub1) is easy, but using it to define the second variable will be the most precise.

Maybe I should have emphasized that the text is very variable - so a simple regex based on the text pattern will not work. I need to use the first variable to locate the second within the text. It is medical information so I can't post actual data.

解决方案

Try using the paste0() function. That will put together all your variables and any regular expressions you want to use.

grep(paste0("^.*", variable, ".*$"), d1)

you can also add the argument collapse = "" to paste0() if your variable could have >1 element

这篇关于R在正则表达式中使用变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆