使用正则表达式来匹配具有特定ID的div块 [英] Using a regular expression to match a div block having a specific ID

查看:172
本文介绍了使用正则表达式来匹配具有特定ID的div块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 < p< p>< ; div\s + [^>] * \s * id \s * = \s * [|'] content [|'] \s *> [^ / div] + 

我希望正则表达式匹配整个div块。所以我把[^ / div] +放到了我的正则表达式中,我假设它会匹配剩余的字符,直到它到达结尾,但它不匹配直到结束,因为[^]表达式认为我不想要以匹配任何< / | d |我| v | >。我想要的是把整个事情看作一个整体。让一个[^()]也没有帮助..

所以请告诉我,我应该如何编码这个问题

 < div id =content> 
< noscript>< / noscript>
< a href =blabla.com>
< h1>
< a href =blablac.com> Blablabla< / a>
< / h1>
< / div>


解决方案

免责声明:我同意,一般来说,正则表达式并不是解析HTML的最佳工具。然而,在正确的手中,(还有一些注意事项),菲利普·哈泽尔的强大(并且最确定的是 - preg _ *()函数族),确实可以解决诸如此类的非平凡数据挖掘问题(有一些限制和注意事项 - 请参见下文)。上述问题对于单独使用正则表达式来解决起来尤其复杂,正则表达式解决方案(例如下面介绍的正则表达式不适用于所有人)并且不应试图被正则表达式新手尝试。为了正确理解下面的答案,需要对几种高级正则表达式结构和技术有相当深入的理解。

请不要请 考虑孩子们!是的,我读过bobince的传奇回答,我知道这是一个敏感的话题(至少可以说)。但是,如果你想立即点击向下投票箭头,因为我是'/(?: actual | brave | stupid)ly /' REGEX HTML 在同一口气中(以及一个不平凡的问题不少),我会虚心地要求你长时间阅读这篇文章,以实际尝试这个解决方案。



考虑到这一点,如果你想看看如何制定先进的正则表达式来解决这个问题,(对于所有人但几个(不太可能)的特殊情况 - 见下面的例子),阅读...



先进的RECURSIVE REGEX解决方案: Hardaker正确指出, DIV s可以(并且经常)嵌套。然而,当他说你无法构造一个匹配到正确的< / div> 时,他不是100%正确的。事实是,在PHP中,你可以!(有一些限制 - 见下文)。像Perl和.NET一样,PHP中的PCRE正则表达式引擎提供了递归表达式(即(?R)(?1)(?2)等),它们允许将嵌套结构匹配到任意深度(仅受内存限制)。例如,您可以使用以下表达式轻松匹配平衡嵌套括号:'/ \((?:[^()] ++ |(?R))* + \)/'。如果您有任何疑问,请运行此简单测试:
$ b

  $ text ='zero(one(two )一(二(3)2)之一)零'; 
if(preg_match('/ \((?:[^()] ++ |(?R))* + \)/',$ text,$ matches)){
print_r ($比赛);





$ b因此,如果我们都同意一个PHP正则表达式确实可以匹配嵌套结构,让我们继续讨论手头的问题。这个问题很复杂,因为最外面的 DIV 必须具有 id =content属性,但任何嵌套 DIV s可能会也可能不会。因此,我们不能使用(?R) recursively-match-the-whole-expression 结构,因为与外部匹配的子表达式DIV不同于匹配内部 DIV s所需的DIV。在这种情况下,我们需要一个捕获组(在本例中为组2),它将作为一个递归子例程,它匹配内部嵌套的 DIV code>的。所以这里是一个经过测试的PHP代码片段,它体现了一种先进的不是那种模糊的,但是完全被评论过的,所以你可能真的能够制作-some-sense-out-of-it 正则表达式匹配(大多数情况下 - 见下文), DIV 具有 id =content,它本身可以包含嵌套 DIV s:

  $ re ='%#匹配id =content的DIV元素。 

[^>] *? #懒洋洋地匹配id属性。
\bid\s * + = \s * +#id属性名称和=
([\']?+)#$ 1:可选引号分隔符
\\ \\ bcontent \b#要匹配的特定ID
(?(1)\1)#如果打开报价,匹配相同的结束报价
[^>] * +>#剩余的外部DIV开始标记
(#$ 2:DIV内容(可以递归调用!)
(?:#DIV内容替换的非捕获组
#DIV内容选项1:全部非DIV,非注释...
[^<] ++#一个或多个非标记,非注释字符
#DIV内容选项2: DIV标记...
|<#匹配<,但前提是
(?!#不是
/?div\b#a的开头DIV开始或结束标记,
|! - #或HTML注释
)#好的,<不是DIV或通讯ent。
#DIV内容选项3:HTML注释。
| < - !?* - > #符合非SGML标准的HTML评论。
#DIV内容选项4:嵌套的DIV元素!
| < div\b [^>] * +> #内部DIV元素开始标记。
(?2)#将组2复制为嵌套子例程。
< / div's *> #内部DIV元素结束标记。
)* +#零个或多个这些内容替代品。
)#结束2 $:DIV内容。
< / div's *> #外部DIV结束标记。
%isx';
if(preg_match($ re,$ text,$ matches)){
printf(Match found:\\\
%s\\\
,$ matches [0]);
}

正如我所说,这个正则表达式非常复杂,但请放心,它确实有效!除了下面提到的一些不太可能的情况外(可能还有其他一些情况,如果你能找到,我会非常感激的)。尝试一下,看看自己!

我应该使用它吗?在生产环境中使用此正则表达式解决方案是否合适?或者数以千计的文档必须以100%的可靠性和准确性进行解析?当然不是。对于一些有限的HTML文件的一次运行它可能会有用吗? (例如,可能是问这个问题的人?)可能。这取决于先进的正则表达式有多舒适。如果上面的正则表达式看起来像是用外语写的(它是),并且/或者将dickens吓到了你,那么答案可能就是否定的。

它有效吗?是的。例如,给定以下测试数据,上面的正则表达式正确地挑出了 DIV ,其中 id =content (或者 id ='content'或者 id = content ):

 <!DOCTYPE HTML SYSTEM> 
< html>
< head>< title> Test Page< / title>< / head>
< body>
< div id =non-content-div>
< h1> PCRE进行递归!< / h1>
< div id ='content'>
< h2>第一级匹配< / h2>
<! - 此评论< / div>是棘手 - >
< div id =one-deep>
< h3>第二级匹配< / h3>
< div id = two-deep>
< h4>第三级匹配< / h4>
< div id = three-deep>
< h4>第四级匹配< / h4>
< / div>
< p>东西< / p>
< / div>
<! - 此评论< div>是棘手 - >
< p>东西< / p>
< / div>
< p>东西< / p>
< / div>
< p>东西< / p>
< / div>
< p>东西< / p>
< / body>< / html>

CAVEATS:那么该解决方案无法运行的情况是什么?那么, DIV 开始标签在它们的任何属性中都可能没有任何尖括号(可以去除这个限制,但是这增加了很多代码)。以下 CDATA 跨度(包含我们正在寻找的特定 DIV 开始标记)正则表达式失败:


$ b

 < style type =text / css> 
p:之前{
content:'不太可能带有< div id = content>的CSS字符串在里面。';
}
< / style>
< p title =其中包含< div id = content>的不太可能的属性> stuff< / p>
< script type =text / javascript>
alert(邪恶脚本中包含< div id = content>>);
< / script>
<! - Comment with< div id = 内容>其中 - >
<![CDATA [其中< div id =content>的CDATA部分]]>


GO阅读MRE3 正如我之前所说,要真正掌握这里发生的事情,需要对几种先进技术有深入的了解,这些技术并不明显或直观,我只知道一种获得这些技能的方法那就是坐下来学习:SlashDot评论评为11分(满分10分) - 而我肯定同意!>掌握正则表达式(第三版)
Jeffrey Friedl(MRE3)(你会很高兴你这么做!)

我可以诚实地说这是我一辈子读过的最有用的书!



干杯!

编辑2013-04-30 修正了Regex。它之前不允许紧跟在 DIV 开始标记后的非 DIV 标记。


I'm trying to match a block of div that have a particular id.. Here's my regex code:

<div\s+[^>]*\s*id\s*=\s*["|']content["|']\s*>[^/div]+

I want the regex to match the whole div block. So I put [^/div]+ in my regex, I assume that it will match the remaining characters until it reaches the end of the but it failed to match until the end because the [^] expression thought that I don't want to match anything that is < / | d | i | v | >. What I want is to consider the whole thing as a whole.Putting a [^()] doens't help either..

So please tell me how should i code this problem

<div id="content">
    <noscript></noscript>
    <a href="blabla.com">
    <h1>
       <a href="blablac.com">Blablabla</a>
    </h1>
</div>

解决方案

DISCLAIMER: First, I agree that, in general, regex is not the best tool for parsing HTML. However, in the right hands, (and with a few caveats), Philip Hazel's powerful (and most assuredly non-REGULAR) PCRE library, (used by PHP's preg_*() family of functions), does allow solving non-trivial data scraping problems such as this one (with some limitations and caveats - see below). The problem stated above is particularly complex to solve using regex alone, and regex solutions such as the one presented below are not for everyone and should never be attempted by a regex novice. To properly understand the answer below requires fairly deep comprehension of several advanced regex constructs and techniques.

Won't someone please think of the Children! Yes, I have read bobince's legendary answer and I know this is a touchy subject around here (to say the least). But please, if you are tempted to immediately click the down-vote arrow, because I am '/(?:actual|brave|stupid)ly/' using the words: REGEX and: HTML in the same breath (and on a non-trivial problem no-less), I would humbly ask you to refrain long enough to read this entire post and to actually try this solution out for yourself.

With that in mind, if you would like to see how an advanced regex can be crafted to solve this problem, (for all but a few (unlikely) special cases - see below for examples), read on...

AN ADVANCED RECURSIVE REGEX SOLUTION: As Wes Hardaker correctly points out, DIVs can (and frequently are) nested. However, he is not 100% correct when he says "you can't construct one that will match up until the correct </div>". The truth is, with PHP, you can! (with some limitations - see below). Like Perl and .NET, the PCRE regex engine in PHP provides recursive expressions (i.e. (?R), (?1), (?2), etc) which allow matching nested structures to any arbitrary depth (limited only by memory). For example, you can easily match balanced nested parentheses with this expression: '/\((?:[^()]++|(?R))*+\)/'. Run this simple test if you have any doubts:

$text = 'zero(one(two)one(two(three)two)one)zero';
if (preg_match('/\((?:[^()]++|(?R))*+\)/', $text, $matches)) {
    print_r($matches);
}

So if we can all agree that a PHP regex can, indeed, match nested structures, let's move on to the problem at hand. This particular problem is complicated by the fact that the outermost DIV must have the id="content" attribute, but any nested DIVs may or may not. Thus, we can't use the (?R) recursively-match-the-whole-expression construct, because the subexpression to match the outer DIV is not the same as the one needed to match the inner DIVs. In this case, we need to have a capture group (in this case group 2), that will serve as a "recursive subroutine", which matches inner, nested DIV's. So here is a tested PHP code snippet, sporting an advanced not-for-the-faint-of-heart-but-fully-commented-so-that-you-might-actually-be-able-to-make-some-sense-out-of-it regex, which correctly matches (in most cases - see below), a DIV having id="content", which may itself contain nested DIVs:

$re = '% # Match a DIV element having id="content".
    <div\b             # Start of outer DIV start tag.
    [^>]*?             # Lazily match up to id attrib.
    \bid\s*+=\s*+      # id attribute name and =
    ([\'"]?+)          # $1: Optional quote delimiter.
    \bcontent\b        # specific ID to be matched.
    (?(1)\1)           # If open quote, match same closing quote
    [^>]*+>            # remaining outer DIV start tag.
    (                  # $2: DIV contents. (may be called recursively!)
      (?:              # Non-capture group for DIV contents alternatives.
      # DIV contents option 1: All non-DIV, non-comment stuff...
        [^<]++         # One or more non-tag, non-comment characters.
      # DIV contents option 2: Start of a non-DIV tag...
      | <            # Match a "<", but only if it
        (?!          # is not the beginning of either
          /?div\b    # a DIV start or end tag,
        | !--        # or an HTML comment.
        )            # Ok, that < was not a DIV or comment.
      # DIV contents Option 3: an HTML comment.
      | <!--.*?-->     # A non-SGML compliant HTML comment.
      # DIV contents Option 4: a nested DIV element!
      | <div\b[^>]*+>  # Inner DIV element start tag.
        (?2)           # Recurse group 2 as a nested subroutine.
        </div\s*>      # Inner DIV element end tag.
      )*+              # Zero or more of these contents alternatives.
    )                  # End 2$: DIV contents.
    </div\s*>          # Outer DIV end tag.
    %isx';
if (preg_match($re, $text, $matches)) {
    printf("Match found:\n%s\n", $matches[0]);
}

As I said, this regex is quite complex, but rest assured, it does work! with the exception of some unlikely cases noted below - (and probably a few more that I would be very grateful if you could find). Try it out and see for yourself!

Should I use this? Would it be appropriate to use this regex solution in a production environment where hundreds or thousands of documents must be parsed with 100% reliability and accuracy? Of course not. Could it be useful for a limited one time run of some HTML files? (e.g. possibly the person who asked this question?) Possibly. It depends on how comfortable one is with advanced regexes. If the regex above looks like it was written in a foreign language (it is), and/or scares the dickens out of you, the answer is probably no.

It works? Yes. For example, given the following test data, the regex above correctly picks out the DIV having the id="content" (or id='content' or id=content for that matter):

<!DOCTYPE HTML SYSTEM>
<html>
<head><title>Test Page</title></head>
<body>
<div id="non-content-div">
    <h1>PCRE does recursion!</h1>
    <div id='content'>
        <h2>First level matched</h2>
        <!-- this comment </div> is tricky -->
        <div id="one-deep">
            <h3>Second level matched</h3>
            <div id=two-deep>
                <h4>Third level matched</h4>
                <div id=three-deep>
                    <h4>Fourth level matched</h4>
                </div>
                <p>stuff</p>
            </div>
            <!-- this comment <div> is tricky -->
            <p>stuff</p>
        </div>
        <p>stuff</p>
    </div>
    <p>stuff</p>
</div>
<p>stuff</p>
</body></html>

CAVEATS: So what are some scenarios where this solution does not work? Well, DIV start tags may NOT have any angle brackets in any of their attributes (it is possible to remove this limitation, but this adds quite a bit more to the code). And the following CDATA spans, which contain the specific DIV start tag we are looking for (highly unlikely), will cause the regex to fail:

<style type="text/css">
p:before {
    content: 'Unlikely CSS string with <div id=content> in it.';
}
</style>
<p title="Unlikely attribute with a <div id=content> in it">stuff</p>
<script type="text/javascript">
    alert("evil script with <div id=content> in it">");
</script>
<!-- Comment with <div id="content"> in it -->
<![CDATA[ a CDATA section with <div id="content"> in it ]]>

I would very much like to know of any others.

GO READ MRE3 As I said before, to truly grasp what is going on here requires a pretty deep understanding of several advanced techniques. These techniques are not obvious or intuitive. There is only one way that I know of to gain these skills and that is to sit down and study: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl (MRE3). (You will be glad you did!)

I can honestly say that this is the most useful book I have read in my entire life!

Cheers!

EDIT 2013-04-30 Fixed Regex. It previously disallowed a non-DIV tag which immediately followed the DIV start tag.

这篇关于使用正则表达式来匹配具有特定ID的div块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆