正则表达式拆分 HTML 标签 [英] Regex to split HTML tags

查看:66
本文介绍了正则表达式拆分 HTML 标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的 HTML 字符串:

I have an HTML string like so:

<img src="http://foo"><img src="http://bar">

将其拆分为两个单独的 img 标签的正则表达式模式是什么?

What would be the regex pattern to split this into two separate img tags?

推荐答案

你有多确定你的字符串完全是那个?像这样的输入怎么样:

How sure are you that your string is exactly that? What about input like this:

<img alt=">"          src="http://foo"  >
<img src='http://bar' alt='<'           >

这是什么编程语言?是否有某种原因您没有使用标准的 HTML 解析类来处理这个问题?当您拥有一组非常知名的输入时,正则表达式只是一种很好的方法.它们不适用于真正的 HTML,仅适用于操纵演示.

What programming language is this? Is there some reason you're not using a standard HTML-parsing class to handle this? Regexes are only a good approach when you have an extremely well-known set of inputs. They don't work for real HTML, only for rigged demos.

即使您必须使用正则表达式,也应该使用正确的语法.这很容易.我已经在无数网页上测试了以下程序.它处理我上面概述的案例 - 以及其他一两个案例.

Even if you must use a regex, you should use a proper grammatical one. This is quite easy. I've tested the following programacita on a zillion web pages. It takes care of the cases I outline above — and one or two others, too.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

my $img_rx = qr{

    # save capture in $+{TAG} variable
    (?<TAG> (?&image_tag) )

    # remainder is pure declaration
    (?(DEFINE)

        (?<image_tag>
            (?&start_tag)
            (?&might_white) 
            (?&attributes) 
            (?&might_white) 
            (?&end_tag)
        )

        (?<attributes>
            (?: 
                (?&might_white) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            \b
            (?&legal_attribute)
            (?&might_white) = (?&might_white) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<required_attribute>
            alt
          | src
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.

        (?<permitted_attribute>
            height
          | is map
          | long desc
          | use map
          | width
        )

        (?<deprecated_attribute>
             align
           | border
           | hspace
           | vspace
        )

        (?<standard_attribute>
            class
          | dir
          | id
          | style
          | title
          | xml:lang
        )

        (?<event_attribute>
            on abort
          | on click
          | on dbl click
          | on mouse down
          | on mouse out
          | on key down
          | on key press
          | on key up
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<might_white>     \s *   )

        (?<start_tag>  
            < (?&might_white) 
            img 
            \b       
        )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

    )

}six;

$/ = undef;
$_ = <>;   # read all input

# strip stuff we aren't supposed to look at
s{ <!    DOCTYPE  .*?         > }{}sx; 
s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

s{ <script> .*?  </script> }{}gsix; 
s{ <!--     .*?        --> }{}gsx;

my $count = 0;

while (/$img_rx/g) {
    printf "Match %d at %d: %s\n", 
            ++$count, pos(), $+{TAG};
} 

你去吧.没什么!

哎呀,考虑到在正则表达式中处理 HTML 是多么容易,你为什么要永远想要使用 HTML 解析类.☺

Gee, why would you ever want to use an HTML-parsing class, given how easily HTML can be dealt with in a regex. ☺

这篇关于正则表达式拆分 HTML 标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆