如何提取某些HTML标记< UL>在PHP中使用正则表达式与preg_match_all? [英] How can I extract certain HTML tags e.g. <ul> using Regex with preg_match_all in PHP?

查看:135
本文介绍了如何提取某些HTML标记< UL>在PHP中使用正则表达式与preg_match_all?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对正则表达式很陌生。我想从网页源获取一些数据。我使用
file_get_contents(url)来获取页面的HTML源代码。现在我想捕获一些特殊标签中的一部分。



我发现 preg_match_all()适用于此。现在我想要一些帮助来解决我的问题,如果可能的话,帮助我找出如何解决类似的问题。



在下面的例子中,我将如何获得< ul> 中的数据?
(我希望这个示例HTML代码可以让我更容易理解。)

 < div class = A_A> QQQQQ<跨度> WWW< /跨度> < / DIV> 
< ul>
< li>
< div class =a_a>< h3> aaaa< / h3> aaaa aaaaa< / div>
< / li>
< li>
< div class =b_b> bbbbb< span class =s-s> bbbb< / span> BBBB< / DIV>
< / li>
< li>
< div class =c_c d-d> cccc cccc ccccc< / div>
< / li>
< / ul>
< table>
< tr>
< td> sdsdf< / td>
< td> hjhjhj< / td>
< / tr>
< tr>
< td> yuyuy< / td>
< td> ertre< / td>
< / tr>
< / table>


解决方案

正如评论所述,通常不建议用正则表达式解析html。在我看来,这取决于你打算做什么。






如果你想使用正则表达式并知道,没有嵌套标签是同类型的,这是获取< ul> 和最接近<$ c之间所有内容的最简单模式$ c>< / ul> 应该是:

  $ pattern ='〜< ul> ;(。*?)< / UL>〜S'; 

匹配< ul> 尽可能少的字符以满足< / ul> 。该点是一个元字符,匹配任何单个字符换行符( \\\
)。为了使它与新行匹配,在结束后分隔符 我把s- 修饰符量词 * 表示零或更多次。


默认量词都是贪婪,这意味着,他们尽可能吃得满足。 * 之后的问号使得它们非贪婪(或 lazy )并匹配尽可能少的字符以符合< / ul> 。由于模式分隔符,我选择了 tilde。

  preg_match_all($ pattern,$ html,$ out); 

匹配被捕获并且可以在输出变量中找到,您为<$ c $设置c> preg_match 或 preg_match_all ,其中 [0] 包含所有与整个模式, [1] 第一个被捕获的带括号的子模式,...






如果您的搜索标签可以包含属性(例如< ul class =my_list... ),那么这个扩展模式会在<之后; ul 还包含 [^>] * 任意数量的字符,不是> 会议前>

  $ pattern ='〜< ; UL [^>] *> \K *(= LT; / UL>)。〜UIS'; 

代替问号,我在这里使用 U - 修饰符,使所有量词懒惰。只需获取所需的部分,即< / ul> 内的< ul> \K 用于重置报告匹配的开始。而不是捕捉结尾< / ul> a lookahead 被用于(?= ,因为我们都不想在输出中使用该部分。



这基本上与'〜

    ] *>(。*)< / ul>〜Uis'模式匹配 [0] 和第一个括号组为 [1] 。


    $但是,如果您的html 包含相同类型的嵌套标记,则以下模式的想法是捕获最内层< ul> ... < / ul> 中的每个字符都会检查如果没有开头< ul

      $ pattern ='〜 < ul [^>]> \ K(?:(?!  

    使用 preg_match_all

    获得匹配

      $ html = < DIV>< UL><李>< UL><李> 0.1< /立GT;< / UL> ...< /立GT;< / UL>< / DIV> ; 
    < ul>< li> .2。< / li>< / ul>';

    if(preg_match_all($ pattern,$ html,$ out))
    {
    echo< pre>;的print_r(array_map(用htmlspecialchars’,$ OUT [0]));回声< / pre>;
    } else {

    echoFAIL;

    \K (?= 将被捕获到 $ out [0]





    使用修饰符 Uis (部分在结尾 delimiter



    U (PCRE_UNGREEDY), i (PCRE_CASELESS), s (PCRE_DOTALL)

    I am new to regular expressions. I want to fetch some data from a web page source. I used file_get_contents("url") to get the page's HTML source. Now I want to capture a portion within some special tags.

    I found preg_match_all() works for this. Now I want some help to solve my problem and if possible help me to find out how to solve similar problems like this.

    In the example below, how would I get the data within the <ul>? (I wish this sample HTML code could be easier for me to understand.)

    <div class="a_a">qqqqq<span>www</span> </div>
    <ul>
    <li>
        <div class="a_a"><h3>aaaa</h3> aaaa aaaaa</div>
    </li>
    <li>
        <div class="b_b">bbbbb <span class="s-s">bbbb</span> bbbb</div>
    </li>
    <li>
        <div class="c_c d-d">cccc cccc ccccc</div>
    </li>
    </ul>
    <table>
    <tr>
        <td>sdsdf</td>
        <td>hjhjhj</td>
    </tr>
    <tr>
        <td>yuyuy</td>
        <td>ertre</td>
    </tr>   
    </table>
    

    解决方案

    As the comments stated already, it's generally not recommended to parse html with regex. In my opinion, it depends on what exactly you're going to do.


    If you want to use regex and know, that there are no nested tags of the same kind, the most simple pattern for getting everything that's between <ul> and closest </ul> would be:

    $pattern = '~<ul>(.*?)</ul>~s';
    

    It matches <ul> followed by as few characters of any kind as possible to meet </ul>. The dot is a metacharacter, that matches any single character except newlines (\n). To make it match newlines too, after the ending delimiter ~ I put the s-modifier. The quantifier * means zero or more times.

    By default quantifiers are greedy, which means, they eat up as much as possible to be satisfied. A question-mark ? after the * makes them non-greedy (or lazy) and match as few characters as possible to meet </ul>. As pattern-delimiter I chose the ~ tilde.

    preg_match_all($pattern, $html, $out);
    

    Matches are captured and can be found in the output-variable, that you set for preg_match or preg_match_all, where [0] contains everything, that matches the whole pattern, [1] the first captured parenthesized subpattern, ...


    If your searched tag can contain attributes (e.g. <ul class="my_list"...) this extended pattern, would after <ul also include [^>]* any amount of characters, that are not > before meeting >

    $pattern = '~<ul[^>]*>\K.*(?=</ul>)~Uis';
    

    Instead of the question-mark, here I use the U-modifier, to make all quantifiers lazy. For only getting captured the desired parts, that are <ul> inside </ul>. \K is used to reset beginning of the reported match. Instead of capturing the ending </ul> a lookahead is used (?=, as we neither want that part in the output.

    This is basically the same as '~<ul[^>]*>(.*)</ul>~Uis' which would capture whole-pattern matches to [0] and first parenthesized group to [1].


    But, if your html contains nested tags of same kind, the idea of the following pattern is to catch the innermost ones. At each character inside <ul>...</ul> it checks if there is no opening <ul

    $pattern = '~<ul[^>]*>\K(?:(?!<ul).)*(?=</ul>)~Uis';
    

    Get matches using preg_match_all

    $html = '<div><ul><li><ul><li>.1.</li></ul>...</li></ul></div>
             <ul><li>.2.</li></ul>';
    
    if(preg_match_all($pattern, $html, $out))
    {
      echo "<pre>"; print_r(array_map('htmlspecialchars',$out[0])); echo "</pre>";
    } else {
    
      echo "FAIL";
    }
    

    Matches between \K and (?= will be captured to $out[0]

    • \K resets beginning of the reported match (supported in PHP since 5.2.4)
    • the second pattern, when <ul> matched, looks ahead (?!... at each character, if there's no opening <ul before meeting </ul>, if so starts over until </ul> is ahead (?=</ul>).
    • [^>]* any amount of characters, that are not > (negated character class)
    • (?: starts a non-capturing group.

    Used Modifiers: Uis (part after the ending delimiter ~)

    U (PCRE_UNGREEDY), i (PCRE_CASELESS), s (PCRE_DOTALL)

    这篇关于如何提取某些HTML标记&LT; UL&GT;在PHP中使用正则表达式与preg_match_all?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆