使用正则表达式获取 HTML 标签的内部文本 [英] Getting innertext of HTML tags using Regular Expressions

查看:56
本文介绍了使用正则表达式获取 HTML 标签的内部文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法捕获此数据:

              <tr>
                <td><span class="bodytext"><b>Contact:</b><b></b></span><span style='font-size:10.0pt;font-family:Verdana;
  mso-bidi-font-family:Arial'><b> </b> 
                      <span class="bodytext">John Doe</span> 
                     </span></td>
              </tr>
              <tr>
                <td><span class="bodytext">PO Box 2112</span></td>
              </tr>
              <tr>
                <td><span class="bodytext"></span></td>
              </tr>

              <!--*********************************************************


              -->
              <tr>
                <td><span class="bodytext"></span></td>
              </tr>



              <tr>
                <td><span class="bodytext">JOHAN</span> NSW 9700</td>
              </tr>
              <tr>
                <td><strong>Phone:</strong> 
                02 9999 9999
                    </td>
              </tr>

基本上,我想抓取Contact:"之后和Phone:"之前的所有内容,减去 HTML;但是,这两个名称可能并不总是存在,所以我需要真正抓住两个冒号 (:) 之间不位于 HTML 标签内的所有内容.<span class="bodytext">***data***</span> 的数量实际上可能会有所不同,所以我需要某种循环来匹配这些.

Basically, I want to grab everything after "Contact:" and before "Phone:" minus the HTML; however these two designations may not always exist so I need to really grab everything between the two colons (:) that isn't located inside a HTML tag. The number of <span class="bodytext">***data***</span> may actually vary so I need some sort of loop for matching these.

我更喜欢使用正则表达式,因为我可以可能会使用循环和字符串匹配来做到这一点.

I prefer to use regular expressions as I could probably do this using loops and string matches.

另外,我想知道 PHP 正则表达式中非匹配组的语法.

Also, I'd like to know the syntax for non-matching groups in PHP regex.

任何帮助将不胜感激!

推荐答案

如果我理解正确的话,您只对 HTML 标签之间的文本感兴趣.要忽略 HTML 标签,只需先去除它们:

If I understand you correctly, you're only interested in the text between the HTML tags. To ignore the HTML tags, simply strip them first:

$text = preg_replace('/<[^<>]+>/', '', $html);

要获取联系人:"和电话:"之间的所有内容,请使用:

To grab everything between "Contact:" and "Phone:", use:

if (preg_match('/Contact:(.*?)Phone:/s', $text, $regs)) {
  $result = $regs[1];
} else {
  $result = "";
}

要获取两个冒号之间的所有内容,请使用:

To grab everything between two colons, use:

if (preg_match('/:([^:]*):/', $text, $regs)) {
  $result = $regs[1];
} else {
  $result = "";
}

这篇关于使用正则表达式获取 HTML 标签的内部文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆