删除所有空的HTML标签? [英] Remove all empty HTML tags?

查看:205
本文介绍了删除所有空的HTML标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想象,我推测会使用正则表达式的函数,这将是递归的如&LT的实例; P><强>< / STRONG>< / P> 删除所有空的HTML标记字符串中。这必须考虑到空白如果可能的话。不会有疯狂的情况下,其中<字符被用来在属性值中

I am imagining a function which I figure would use Regex, and it would be recursive for instances like <p><strong></strong></p> to remove all empty HTML tags within a string. This would have to account for whitespace to if possible. There would be no crazy instances where < character was being used in an attribute value.

我是pretty的可怕的正则表达式,但我想这是可能的。你该怎么办呢?

I am pretty terrible at regex but I imagine this is possible. How can you do it?

下面是方法我到目前为止有:

Here is the method I have so far:

Public Shared Function stripEmptyHtmlTags(ByVal html As String) As String
    Dim newHtml As String = Regex.Replace(html, "/(<.+?>\s*</.+?>)/Usi", "")

    If html <> newHtml Then
        newHtml = stripEmptyHtmlTags(newHtml)
    End If

    Return newHtml
End Function

不过我目前的正则表达式是在PHP格式,它似乎并不奏效。我不熟悉.NET正则表达式的语法。

However my current Regex is in PHP format, and it doesn't seem to be working. I am not familiar with .NET regex syntax.

对所有那些说不要使用正则表达式:我很好奇的格局会是怎样考虑。当然,有一个在标签之间的模式可能匹配所有的开/关启动标签用空格(或无)任何金额?我见过的正则表达式与任意数量的属性,有一个空标记匹配的HTML标记(如刚&LT; P&GT;&LT; / P&GT; )。等等。

To all those saying don't use regex: I am curious what the pattern would be regardless. Surely there is a pattern which could match all opening/closing start tags with any amount of white space (or none) in between the tags? I've seen regex that matches HTML tags with any number of attributes, one empty tag (such as just <p></p>) etc.

到目前为止,我已经尝试过上面的方法下面的正则表达式模式无果(如,我有一个空的段落标记,甚至没有删除文本字符串。)

So far I have tried the following regex patterns in the above method to no avail (as in, I have a text string with empty paragraphs tags that didn't even get removed.)

Regex.Replace(HTML,/(<.+?>\s*</.+?>)/Usi,)

Regex.Replace(HTML,(小于?+&GT; \ S *&LT; /.+&GT;?),)

Regex.Replace(HTML,%≤(\ W +)\ B〔^&GT;] *&GT; \ S *&LT; / \ 1 \ S *&GT;%, )

Regex.Replace(HTML,&LT; \ W + \ S *&GT; \ S *&LT; / \ 1 \ S *&gt;中,)

推荐答案

首先,注意空的HTML元素,根据定义,没有嵌套。

First, note that empty HTML elements are, by definition, not nested.

更新:低于现在的解决方案,适用于空元素的正则表达式递归删除的嵌套空元素的结构,例如:&LT; P&GT;&LT;强&GT;&LT; / STRONG&GT;&LT; / P&GT; (受以下所述的注意事项)

Update: The solution below now applies the empty element regex recursively to remove "nested-empty-element" structures such as: <p><strong></strong></p> (subject to the caveats stated below).

本作品pretty的好(见下面的注意事项)对HTML有无从下手的标签属性包含&LT; 有趣的东西,在一个企业的形式(未经测试;&GT )VB.NET代码段:

This works pretty well (see caveats below) for HTML having no start tag attributes containing <> funny stuff, in the form of an (untested) VB.NET snippet:

Dim RegexObj As New Regex("<(\w+)\b[^>]*>\s*</\1\s*>")
Do While RegexObj.IsMatch(html)
    html = RegexObj.Replace(html, "")
Loop

增强版

<$c$c><(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>

下面是在VB.NET中取消注释增强版(未经测试):

Here is the uncommented enhanced version in VB.NET (untested):

Dim RegexObj As New Regex("<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:""[^""]*""|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>")
Do While RegexObj.IsMatch(html)
    html = RegexObj.Replace(html, "")
Loop

这更复杂的正则表达式匹配正确有效的空的HTML 4.01元素的即使它有尖括号中的属性值的(视乎再次,下面的注意事项)。换句话说,此正则表达式正确处理所有启动这些引用标记的属性值(它可以有&LT;&GT; ),不带引号(不能)和空。这里是一个完全注释(和测试)PHP版本:

This more complex regex correctly matches a valid empty HTML 4.01 element even if it has angle brackets in its attribute values (subject once again, to the caveats below). In other words, this regex correctly handles all start tag attribute values which are quoted (which can have <>), unquoted (which can't) and empty. Here is a fully commented (and tested) PHP version:

function strip_empty_tags($text) {
    // Match empty elements (attribute values may have angle brackets).
    $re = '%
        # Regex to match an empty HTML 4.01 Transitional element.
        <                    # Opening tag opening "<" delimiter.
        (\w+)\b              # $1 Tag name.
        (?:                  # Non-capture group for optional attribute(s).
          \s+                # Attributes must be separated by whitespace.
          [\w\-.:]+          # Attribute name is required for attr=value pair.
          (?:                # Non-capture group for optional attribute value.
            \s*=\s*          # Name and value separated by "=" and optional ws.
            (?:              # Non-capture group for attrib value alternatives.
              "[^"]*"        # Double quoted string.
            | \'[^\']*\'     # Single quoted string.
            | [\w\-.:]+      # Non-quoted attrib value can be A-Z0-9-._:
            )                # End of attribute value alternatives.
          )?                 # Attribute value is optional.
        )*                   # Allow zero or more attribute=value pairs
        \s*                  # Whitespace is allowed before closing delimiter.
        >                    # Opening tag closing ">" delimiter.
        \s*                  # Content is zero or more whitespace.
        </\1\s*>             # Element closing tag.
        %x';
    while (preg_match($re, $text)) {
        // Recursively remove innermost empty elements.
        $text = preg_replace($re, '', $text);
    }
}

注意事项::此功能不解析HTML。它只是相匹配,并删除相应的有效的空的HTML 4.01元(其中,顾名思义,就是的没有的嵌套)任何文本模式序列。请注意,这也是错误的匹配,并删除可能出现超出正常的HTML标记,如在脚本和样式标签和HTML注释等开始标记的属性相同的文字图案。这正则表达式不与短标签的工作。对任何bobenc风扇约给这个答案是自动降投票,请告诉我,这正则表达式不能正确匹配一个有效的HTML 4.01空元素。这正则表达式遵循W3C规范,确实工作。

Caveats: This function does not parse HTML. It simply matches and removes any text pattern sequence corresponding to a valid empty HTML 4.01 element (which, by definition, is not nested). Note that this also erroneously matches and removes the same text pattern which may occur outside normal HTML markup, such as within SCRIPT and STYLE tags and HTML comments and the attributes of other start tags. This regex does not work with short tags. To any bobenc fan about give this answer an automatic down vote, please show me one valid HTML 4.01 empty element that this regex fails to correctly match. This regex follows the W3C spec and really does work.

更新:此正则表达式的解决方案还不能正常工作(并会错误地删除有效的标记),如果你做一些事情的疯狂不可能的(但完全有效的)是这样的:

Update: This regex solution also does not work (and will erroneously remove valid markup) if you do something insanely unlikely (but perfectly valid) like this:

&LT; D​​IV ATT =&LT; p ATT ='&GT;东西&LT; / DIV&GT;&LT; D​​IV ATT ='&GT;&LT; / P&GT;'&GT;东西&LT; / DIV&GT;

在第二个想法,只要使用一个HTML解析器!

On second thought, just use an HTML parser!

这篇关于删除所有空的HTML标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆