在 R 中使用正则表达式提取某些符号之间的文本 [英] Extract text between certain symbols using Regular Expression in R
问题描述
我有一系列的表达比如:
I have a series of expressions such as:
"<i>the text I need to extract</i></b></a></div>"
我需要提取 和
符号"之间的文本.也就是说,结果应该是:
I need to extract the text between the <i>
and </i>
"symbols". This is, the result should be:
"the text I need to extract"
目前我在 R 中使用 gsub 手动删除所有不是文本的符号.但是,我想使用正则表达式来完成这项工作.有谁知道提取 和
之间的正则表达式?
At the moment I am using gsub in R to manually remove all the symbols that are not text. However, I would like to use a regular expression to do the job. Does anyone know a regular expression to extract the between <i>
and </i>
?
谢谢.
推荐答案
如果只有一个 <i>...</i>
如示例中那样,则匹配所有内容直到 和
中的所有内容向前并用空字符串替换它们:
If there is only one <i>...</i>
as in the example then match everything up to <i>
and everything from </i>
forward and replace them both with the empty string:
x <- "<i>the text I need to extract</i></b></a></div>"
gsub(".*<i>|</i>.*", "", x)
给予:
[1] "the text I need to extract"
如果同一字符串中可能出现多次,请尝试:
If there could be multiple occurrences in the same string then try:
library(gsubfn)
strapplyc(x, "<i>(.*?)</i>", simplify = c)
在这个例子中给出相同的.
giving the same in this example.
这篇关于在 R 中使用正则表达式提取某些符号之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!