匹配 html 标签之外的文本 [英] Match text outside of html tags

查看:32
本文介绍了匹配 html 标签之外的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在有人说出来之前,我知道我应该使用适当的解析器,但对于我的用例,最好使用正则表达式.

Before anyone says it I know I should use a proper parser but for my use case it is better to use a regular expression.

我有以下正则表达式来尝试匹配 html 标签之外的文本:

I have the following regex to try and match text outside of html tags:

(?<!<[^>]*)(?<Text>.+?)

然而,这似乎与标签的左括号相匹配,即 <.我该如何解决这个问题?

However this seems to be matching the opening bracket of the tag, i.e. <. How can I fix this?

示例输入:

<span style="color:blue">some <strong>bold</strong> text</span>

预期:

some bold text

得到:

<some <bold< text<

链接到 RegexStorm.

推荐答案

问题是您使用的 . 匹配任何字符.用否定字符类替换它,例如 [^<>] 匹配除 <> 之外的任何字符,并使用greedy 量词 *(匹配 0 次或多次出现)或 +(匹配 1 次或多次出现):

The problem is that you are using . that matches any character. Replace it with a negated character class, like [^<>] that matches any char but < and > and use a greedy quantifier * (to match 0 or more occurrences) or + (to match 1 or more occurrences):

(?<!<[^>]*)(?<Text>[^<>]*)

正则表达式演示

顺便说一句,在模式末尾使用 (?.+?) 只会使正则表达式引擎匹配 1 个字符,因为 +? 是一个惰性量词匹配 1 次或多次出现,但尽可能少(因为 1 就足够了,它总是只匹配 1 个字符).通常,在这种懒惰量化的模式之后一定有其他模式,否则,它通常无法获取正确的文本.

BTW, using (?<Text>.+?) at the end of the pattern only makes the regex engine match 1 char since the +? is a lazy quantifier matching 1 or more occurrences but as few as possible (and since 1 is enough, it will always match just 1 char). Usually, there must be some other pattern after such a lazily quantified one, else, it usually does not fetch the right texts.

这篇关于匹配 html 标签之外的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆