正则表达式:如何从最后一个括号中提取文本 [英] Regex: How to extract text from last parenthesis

查看:269
本文介绍了正则表达式:如何从最后一个括号中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从下面的字符串中提取字符串(procedure)" - 或括号内的一般文本 - 的正确正则表达式是什么

What is a correct regular expression to extract the string "(procedure)" -or in general text from inside the parenthesis - from the strings below

输入字符串示例是

使用 flutemetamol (18F) 的正电子发射断层扫描与计算脑断层扫描(程序)

Positron emission tomography using flutemetamol (18F) with computed tomography of brain (procedure)

另一个例子

尿路感染预防(程序)

可能的方法是:

  • 转到文本末尾,查找第一个左括号并从该位置取子集到文本末尾

  • Go to end of the text, and look for first opening parenthesis and take subset from that position to the end of the text

从文本的开头,确定最后一个 '(' 字符并在该位置以子字符串结尾

from beginning of text, identify last '(' char and do that position to end as substring

其他字符串可以(提取不同的标签")

Other strings can be (different "tag" is extracted)

[1] "Xanthoma of eyelid (disorder)"                    "Ventricular tachyarrhythmia (disorder)"          
[3] "Abnormal urine odor (finding)"                    "Coloboma of iris (disorder)"                     
[5] "Macroencephaly (disorder)"                        "Right main coronary artery thrombosis (disorder)"

(寻求通用正则表达式)(或 R 中的解决方案更好)

(general regex is sought) (or a solution in R is even better)

推荐答案

sub 可以使用正确的正则表达式来做到这一点

sub can do that with the right regex

Text = c("Positron emission tomography using flutemetamol (18F) 
    with computed tomography of brain (procedure)",
    "Urinary tract infection prophylaxis (procedure)", 
    "Xanthoma of eyelid (disorder)",                    
    "Ventricular tachyarrhythmia (disorder)",          
    "Abnormal urine odor (finding)",                    
    "Coloboma of iris (disorder)",                   
    "Macroencephaly (disorder)",                        
    "Right main coronary artery thrombosis (disorder)")
sub(".*\\((.*)\\).*", "\\1", Text)
[1] "procedure" "procedure" "disorder"  "disorder"  "finding"   "disorder" 
[7] "disorder"  "disorder"

附录:正则表达式的详细解释
该问题要求找到字符串中final 括号组的内容.这个表达式有点令人困惑,因为它包含了两种不同的括号用法,一种是在正在处理的字符串中表示括号,另一种是设置捕获组",即我们指定表达式应该返回哪个部分的方式.表达式由五个基本单元组成:

Addendum: Detailed explanation of the regex
The question asks to find the content of the final set of parentheses in the strings. This expression is slightly confusing because it includes two different uses of parentheses, One is to represent parentheses in the string being processed and the other is to set up a "capturing group", the way that we specify what part should be returned by the expression. The expression is made up of five basic units:

1. Initial .*   - matches everything up to the final open parenthesis. 
   Note that this is relying on "greedy matching"
2. \\(   ...    \\)   - matches the final set of parentheses. 
   Because ( by itself means something else,  we need to "escape" the 
   parentheses by preceding them with \.  That is we want the regular
   expression to say   \(  ...  \).  However, the way R interprets strings,
   if we just typed \( and \),  R would interpret the \ as escaping the (
   and so interpret this as just ( ... ).  So we escape the backslash.  
   R will interpret   \\(  ... \\)      as \( ... \) meaning the literal
   characters ( & ). 
3. ( ... )       Inside the pair in part 2
   This is making use of the special meaning of parentheses.  When we
   enclose an expression in parentheses, whatever value is inside them 
   will be stored in a variable for later use. That variable is called 
   \1,  which is what was used in the substitution pattern. Again, is 
   we just wrote \1, R would interpret it as if we were trying to escape
   the 1. Writing \\1 is interpreted as the character \ followed by 1, 
   i.e. \1.
4. Central .*    Inside the pair in part 3
   This is what we are looking for,  all characters inside the parentheses.
5. Final   .*
   This is in the expression to match any characters that may follow the 
   final set of parentheses. 

子函数将使用它用替换模式 \1 替换匹配的模式(在这种情况下,字符串中的所有字符),即变量的内容包含第一个(在我们的情况下仅)捕获中的内容group - 最后括号内的内容.

The sub function will use this to replace the matched pattern (in this case, all characters in the string) with the substitution pattern \1 i.e. the contents of the variable containing whatever was in the first (in our case only) capturing group - the stuff inside the final parentheses.

这篇关于正则表达式:如何从最后一个括号中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆