如何仅获取给定的捕获组< regex> C ++ [英] How to get only given captured group <regex> c++
问题描述
我想提取标签的内部内容。从下面的字符串中:
I want to extract tag's inner content. From the following string:
<tag1 val=123>Hello</tag1>
我只想得到
Hello
我做什么:
string s = "<tag1 val=123>Hello</tag1>";
regex re("<tag1.*>(.*)</tag1>");
smatch matches;
bool b = regex_match(s, matches, re);
但它会返回两个匹配项:
But it returns two matches:
<tag1 val=123>Hello</tag1>
Hello
当我尝试仅获得第一个这样捕获的组时:
And when I try to get only 1st captured group like this:
"<tag1.*>(.*)</tag1>\1"
我得到零匹配。
请告知。
推荐答案
regex_match
仅返回单个匹配项,其中包含所有捕获组子匹配项(它们的数量取决于模式中有多少个组)。
The regex_match
returns only a single match, with all the capturing group submatches (their number depends on how many groups there are in the pattern).
在这里,您仅获得包含两个子匹配项的1个匹配项:1)完全匹配项,2)捕获第1组值。
Here, you only get 1 match that contains two submatches: 1) whole match, 2) capture group 1 value.
要获取捕获组的内容,您需要访问 matches
对象的第二个元素 matches [1] .str()
或 matches.str(1)
To obtain the contents of the capturing group, you need to access the smatches
object second element, matches[1].str()
or matches.str(1)
请注意,当您写< tag1。*>(。*)< / tag1&\1
, \1
是不是解析为 backreference ,而是解析为八进制代码1的字符。即使您定义了 backreference (如< tag1。*> ;(。*)< / tag1> \\1
),则需要在< / tag1>之后重复捕获组1捕获的整个文本。 ;
-绝对不是您想要的。实际上,我怀疑此正则表达式是否有用,至少您需要将。*
替换为 [\\s\ \S] *? ,但是用正则表达式解析HTML仍然是一种脆弱的方法。
Note that when you write "<tag1.*>(.*)</tag1>\1"
, the \1
is not parsed as a backreference, but as a char with octal code 1. Even if you defined a backreference (as "<tag1.*>(.*)</tag1>\\1"
) you would require the whole text captured with the capturing group 1 to be repeated after </tag1>
- that is definitely not what you want. Actually, I doubt this regex is any good, at least, you need to replace ".*"
with "[\\s\\S]*?"
, but it is still a fragile approach to parse HTML with regex.
这篇关于如何仅获取给定的捕获组< regex> C ++的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!