匹配 html 标签之外的文本 [英] Match text outside of html tags
问题描述
在有人说出来之前,我知道我应该使用适当的解析器,但对于我的用例,最好使用正则表达式.
Before anyone says it I know I should use a proper parser but for my use case it is better to use a regular expression.
我有以下正则表达式来尝试匹配 html 标签之外的文本:
I have the following regex to try and match text outside of html tags:
(?<!<[^>]*)(?<Text>.+?)
然而,这似乎与标签的左括号相匹配,即 <
.我该如何解决这个问题?
However this seems to be matching the opening bracket of the tag, i.e. <
. How can I fix this?
示例输入:
<span style="color:blue">some <strong>bold</strong> text</span>
预期:
some bold text
得到:
<some <bold< text<
推荐答案
问题是您使用的 .
匹配任何字符.用否定字符类替换它,例如 [^<>]
匹配除 <
和 >
之外的任何字符,并使用greedy 量词 *
(匹配 0 次或多次出现)或 +
(匹配 1 次或多次出现):
The problem is that you are using .
that matches any character. Replace it with a negated character class, like [^<>]
that matches any char but <
and >
and use a greedy quantifier *
(to match 0 or more occurrences) or +
(to match 1 or more occurrences):
(?<!<[^>]*)(?<Text>[^<>]*)
顺便说一句,在模式末尾使用 (?
只会使正则表达式引擎匹配 1 个字符,因为 +?
是一个惰性量词匹配 1 次或多次出现,但尽可能少(因为 1 就足够了,它总是只匹配 1 个字符).通常,在这种懒惰量化的模式之后一定有其他模式,否则,它通常无法获取正确的文本.
BTW, using (?<Text>.+?)
at the end of the pattern only makes the regex engine match 1 char since the +?
is a lazy quantifier matching 1 or more occurrences but as few as possible (and since 1 is enough, it will always match just 1 char). Usually, there must be some other pattern after such a lazily quantified one, else, it usually does not fetch the right texts.
这篇关于匹配 html 标签之外的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!