使用RegEx正确匹配HTML中的结束标记与其开始标记 [英] Correctly matching an ending tag with its starting tag in HTML with a RegEx

查看:95
本文介绍了使用RegEx正确匹配HTML中的结束标记与其开始标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在ASP.Net 2.0应用程序中使用VB.Net运行一些正则表达式,这些正则表达式删除了一些不必要的标记.我想做的一件事是删除其中没有任何属性的span元素:

I'm using VB.Net in an ASP.Net 2.0 app to run some regular expressions that remove some unnecessary markup. One of the things that I'd like to do is remove span elements that don't have any attributes in them:

output = Regex.Replace(output, "<span\s*>(?<Text>.*?)</span>" & styleRegex, "${Text}", RegexOptions.Compiled Or RegexOptions.CultureInvariant Or RegexOptions.IgnoreCase Or RegexOptions.Singleline)

因此,对于此内容:

<span>Lorem <span class="special">ipsum</span> dolor sit amet.</span>

我想删除外跨度元素.不幸的是,上面的正则表达式为我提供了这一结果,因为结束跨度与遇到的第一个跨度匹配:

I'd like to remove the outer span elements. Unfortunately, my regex above gives me this as a result, since the closing span matches the first one it comes across:

Lorem <span class="special">ipsum dolor sit amet.</span>

使用RegEx可能吗?还是我必须实现一些更高级的功能?

Is this possible with a RegEx or will I have to implement something a bit more advanced?

推荐答案

由于输入内容不一定总是有效的XML,并且

XSLT isn't an option since the input may not always be valid XML and the HTML Agility Pack on Codeplex looks pretty sweet but is really overkill in this case. Here's the final RegEx I ended up using:

<span\s*>(?<Text>.*?(?:<span[^>]*>.*?</span>.*?)*)</span>

在我测试过的所有情况下,都用${Text}替换掉有效的外部span标记即可.

Replacing that with ${Text} effectively stripped the useless outer span tags in all cases I've tested.

这篇关于使用RegEx正确匹配HTML中的结束标记与其开始标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆