无法使用 simplehtmldom 正确分隔单元格 [英] Can't separate cells properly with simplehtmldom

查看:40
本文介绍了无法使用 simplehtmldom 正确分隔单元格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个网络爬虫.我想连续获取所有单元格.我想要的前一行将 THOROUGHBRED MEETINGS 作为其纯文本值.我可以成功获得这一行.但我不知道如何获取下一行的子元素,即单元格或 <td> 标签.

if ($foundTag = FindTagByText("THOROUGHBRED MEETINGS", $html)){$cell = $foundTag->parent();$row = $cell->parent();$nextRow = $row->next_sibling();echo "Row: ".$row->plaintext."<br/>\n";echo "下一行:".$nextRow->纯文本."<br/>\n";$cells = $nextRow->children();foreach ($cells 作为 $cell){echo "Cell: ".$cell->plaintext."<br/>\n";}}函数 FindTagByText($text, $html){//使用 Simple_HTML_DOM 特殊选择器 'text'//从文档中检索所有文本节点$textNodes = $html->find('text');$foundTag = null;foreach($textNodes 作为 $textNode){if($textNode->纯文本 == $text){//获取文本节点的父节点//(一个文本节点总是一个子节点//它的容器)$foundTag = $textNode->parent();休息;}}返回 $foundTag;}

这是我试图解析的 html:

<td colspan=16 bgcolor=#999999><b>THOROUGHBRED会议</b></td></tr><tr valign=top bgcolor="#ffffff"><td><b>BR</b><a href="meeting?mtg=br&day=today&curtype=0">阳光海岸</a></td><td>FINE/DEAD</b></td><td><font color=#cc0000><b>R1</b></font>@<b>12:30pm</b></td><td align=center bgcolor=#cc0000><a href="odds?mting=BR01000"><b><font color=#ffffff>1</a></font></td><td align=center><a href="odds?mting=BR02000"><b><font color=black>2</b></font></a></td><td align=center><a href="odds?mting=BR03000"><b><font color=black>3</b></font></a></td><td align=center><a href="odds?mting=BR04000"><b><font color=black>4</b></font></a></td><td align=center><a href="odds?mting=BR05000"><b><font color=black>5</b></font></a></td><td align=center><a href="odds?mting=BR06000"><b><font color=black>6</b></font></a></td><td align=center><a href="odds?mting=BR07000"><b><font color=black>7</b></font></a></td><td align=center><a href="odds?mting=BR08000"><b><font color=black>8</b></font></a></td><td bgcolor="#ffffff" colspan=4>&nbsp;</td></tr>

这是我的输出:

<预>行:纯种会议下一行:BR SUNSHINE COAST FINE/DEAD R1@12:30pm 1 2 3 4 5 6 7 8 CR NEW ZEALAND FINE/DEAD R3@11:10am 1 2 3 4 5 6 7 8 9 DR HOBART OCAST/HVY R1@12:15pm 1 2 3 4 5 6 7 MR CRANBOURNE OCAST/SLOW R1@12:20pm 1 2 3 4 5 6 7 8 NR COFFS HARBOR OCAST/SLOW R1@12:45pm 1 2 3 4 5 6 SR 7 8GOOD R1@12:25pm 1 2 3 4 5 6 7 8 VR BENALLA OCAST/SLOW R1@12:35pm 1 2 3 4 5 6 7 8 XR KALGOORLIE FINE/GOOD R1@ 3:00pm 1 2 3 7 HARN 6会议 DT 朗塞斯顿 SHWRY/GOOD R1@ 4:57pm 1 2 3 4 5 6 7 8 9 10 MT CRANBOURNE OCAST/GOOD R1@ 5:05pm 1 2 3 4 5 6 7 8 GREYHUND MEETINGS ADGA5:10pm 1 2 3 4 5 6 7 8 9 10 11 CD 堪培拉 OCAST/GOOD R1@ 5:02pm 1 2 3 4 5 6 7 8 9 10 11 MD SALE FINE/GOOD R1@ 4:54pm 1 5 2 638 9 10 11 12细胞:BR 阳光海岸单元格:FINE/DEAD手机:R1@12:30pm单元格:1 2 3 4 5 6 7 8 CR NEW ZEALAND FINE/DEAD R3@11:10am 1 2 3 4 5 6 7 8 9 DR HOBART OCAST/HVY R1@12:15pm 1 2 3 4 5 6 7 MR CRANBOURNE/SLOW R1@12:20pm 1 2 3 4 5 6 7 8 NR COFFS HARBOR OCAST/SLOW R1@12:45pm 1 2 3 4 5 6 7 8 SR MORUYA FINE/GOOD R1@12:25pm 1 2 3 6 47 8 VR BENALLA OCAST/SLOW R1@12:35pm 1 2 3 4 5 6 7 8 XR KALGOORLIE FINE/GOOD R1@ 3:00pm 1 2 3 4 5 6 7 HARNESS MEETINGS DT LAUNCESTON SHWRY/GOOD R17pm2 3 4 5 6 7 8 9 10 MT CRANBOURNE OCAST/GOOD R1@ 下午 5:05 1 2 3 4 5 6 7 8 灰狗会议 AD GAWLER OCAST/GOOD R1@ 下午 5:10 1 2 3 7016CD 堪培拉 OCAST/GOOD R1@ 5:02pm 1 2 3 4 5 6 7 8 9 10 11 MD SALE FINE/GOOD R1@ 4:54pm 1 2 3 4 5 6 7 8 9 10 11 12

解决方案

你不会喜欢我的回答.

不幸的是,您正在解析的 HTML 中不匹配的结束标记似乎使 Simple_HTML_DOM 变得混乱.看看这个片段:

<td align=center><a href=odds?mting=BR02000"><b><font color=black>2</b></font><;/a></td>

如果您遵循此代码段的标签顺序:

从技术上讲,标签应该以相反的顺序关闭,但它们是这样关闭的:

您尝试抓取的 HTML 充满了这些错误,以及从未打开过的标签的结束标签.Simple_HTML_DOM 无法正确解析这些文件.

恐怕如果您没有修改 HTML 的可能性,您将不得不手动解析文件,更正任何错误.


请注意,我已经针对以下更正的 HTML 测试了您的代码,并且 Simple_HTML_DOM 成功解析了它,并且您的代码运行良好.

<td colspan=16 bgcolor=#999999><b>THOROUGHBRED会议</b></td></tr><tr valign=top bgcolor=#ffffff"><td><b>BR</b><a href="meeting?mtg=br&day=today&curtype=0">阳光海岸</a></td><td><b>FINE/DEAD</b></td><td><font color=#cc0000><b>R1</font></b>@<b>12:30pm</b></td><td align=center bgcolor=#cc0000><a href=odds?mting=BR01000"><b><font color=#ffffff>1</a></b</font><td align=center><a href=odds?mting=BR02000"><b><font color=black>2</font></b></a>/td><td align=center><a href=odds?mting=BR03000"><b><font color=black>3</font></b></a>/td><td align=center><a href=odds?mting=BR04000"><b><font color=black>4</font></b></a>/td><td align=center><a href=odds?mting=BR05000"><b><font color=black>5</font></b></a>/td><td align=center><a href=odds?mting=BR06000"><b><font color=black>6</font></b></a>/td><td align=center><a href=odds?mting=BR07000"><b><font color=black>7</font></b></a>/td><td align=center><a href=odds?mting=BR08000"><b><font color=black>8</font></b></a>/td><td bgcolor="#ffffff";colspan=4></td></tr>


作为替代方案,您可能想尝试如果 DOMDocument::loadHTML 有更好的结果.它在 PHP 5 中可用,无需外部库.查看官方文档.

I am trying to write a web scraper. I want to get all the cells in a row. The row before the one I want has THOROUGHBRED MEETINGS as its plain text value. I can successfully get this row. But I can't figure out how to get the next row's children which are the cells or <td> tags.

if ($foundTag = FindTagByText("THOROUGHBRED MEETINGS", $html))
{
    $cell = $foundTag->parent();
    $row = $cell->parent();
    $nextRow = $row->next_sibling();
    echo "Row: ".$row->plaintext."<br />\n";
    echo "Next Row: ".$nextRow->plaintext."<br />\n";
    $cells = $nextRow->children();

    foreach ($cells as $cell)
    {
        echo "Cell: ".$cell->plaintext."<br />\n";
    }
}

function FindTagByText($text, $html)
{
    // Use Simple_HTML_DOM special selector 'text'
    // to retrieve all text nodes from the document
    $textNodes = $html->find('text');
    $foundTag = null;

    foreach($textNodes as $textNode) 
    {
        if($textNode->plaintext == $text) 
        {
            // Get the parent of the text node
            // (A text node is always a child of
            //  its container)
            $foundTag = $textNode->parent();
            break;
        }
    }

    return $foundTag;
}

Here is the html I am trying to parse:

<tr valign=top>
<td colspan=16 bgcolor=#999999><b>THOROUGHBRED MEETINGS</b></td>

</tr>
<tr valign=top bgcolor="#ffffff">
<td><b>BR</b> <a href="meeting?mtg=br&day=today&curtype=0">SUNSHINE COAST</a></td>
<td>FINE/DEAD</b></td>
<td><font color=#cc0000><b>R1</b></font>@<b>12:30pm</b></td>
<td align=center bgcolor=#cc0000><a href="odds?mting=BR01000"><b><font color=#ffffff>1</a></font></td>
<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</b></font></a></td>
<td align=center><a href="odds?mting=BR03000"><b><font color=black>3</b></font></a></td>

<td align=center><a href="odds?mting=BR04000"><b><font color=black>4</b></font></a></td>
<td align=center><a href="odds?mting=BR05000"><b><font color=black>5</b></font></a></td>
<td align=center><a href="odds?mting=BR06000"><b><font color=black>6</b></font></a></td>
<td align=center><a href="odds?mting=BR07000"><b><font color=black>7</b></font></a></td>
<td align=center><a href="odds?mting=BR08000"><b><font color=black>8</b></font></a></td>
<td bgcolor="#ffffff" colspan=4>&nbsp;</td>
</tr>

Here is my output:

Row: THOROUGHBRED MEETINGS
Next Row: BR SUNSHINE COAST FINE/DEAD R1@12:30pm 1 2 3 4 5 6 7 8   CR NEW ZEALAND FINE/DEAD R3@11:10am 1 2 3 4 5 6 7 8 9   DR HOBART OCAST/HVY R1@12:15pm 1 2 3 4 5 6 7   MR CRANBOURNE OCAST/SLOW R1@12:20pm 1 2 3 4 5 6 7 8   NR COFFS HARBOUR OCAST/SLOW R1@12:45pm 1 2 3 4 5 6 7 8   SR MORUYA FINE/GOOD R1@12:25pm 1 2 3 4 5 6 7 8   VR BENALLA OCAST/SLOW R1@12:35pm 1 2 3 4 5 6 7 8   XR KALGOORLIE FINE/GOOD R1@ 3:00pm 1 2 3 4 5 6 7     HARNESS MEETINGS DT LAUNCESTON SHWRY/GOOD R1@ 4:57pm 1 2 3 4 5 6 7 8 9 10   MT CRANBOURNE OCAST/GOOD R1@ 5:05pm 1 2 3 4 5 6 7 8     GREYHOUND MEETINGS AD GAWLER OCAST/GOOD R1@ 5:10pm 1 2 3 4 5 6 7 8 9 10 11   CD CANBERRA OCAST/GOOD R1@ 5:02pm 1 2 3 4 5 6 7 8 9 10 11   MD SALE FINE/GOOD R1@ 4:54pm 1 2 3 4 5 6 7 8 9 10 11 12
Cell: BR SUNSHINE COAST
Cell: FINE/DEAD
Cell: R1@12:30pm
Cell: 1 2 3 4 5 6 7 8   CR NEW ZEALAND FINE/DEAD R3@11:10am 1 2 3 4 5 6 7 8 9   DR HOBART OCAST/HVY R1@12:15pm 1 2 3 4 5 6 7   MR CRANBOURNE OCAST/SLOW R1@12:20pm 1 2 3 4 5 6 7 8   NR COFFS HARBOUR OCAST/SLOW R1@12:45pm 1 2 3 4 5 6 7 8   SR MORUYA FINE/GOOD R1@12:25pm 1 2 3 4 5 6 7 8   VR BENALLA OCAST/SLOW R1@12:35pm 1 2 3 4 5 6 7 8   XR KALGOORLIE FINE/GOOD R1@ 3:00pm 1 2 3 4 5 6 7     HARNESS MEETINGS DT LAUNCESTON SHWRY/GOOD R1@ 4:57pm 1 2 3 4 5 6 7 8 9 10   MT CRANBOURNE OCAST/GOOD R1@ 5:05pm 1 2 3 4 5 6 7 8     GREYHOUND MEETINGS AD GAWLER OCAST/GOOD R1@ 5:10pm 1 2 3 4 5 6 7 8 9 10 11   CD CANBERRA OCAST/GOOD R1@ 5:02pm 1 2 3 4 5 6 7 8 9 10 11   MD SALE FINE/GOOD R1@ 4:54pm 1 2 3 4 5 6 7 8 9 10 11 12 

解决方案

You will not like my answer.

Unfortunately, it seems that mismatched closing tags in the HTML you are parsing are confusing Simple_HTML_DOM. Take a look at this snippet:

<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</b></font></a></td>

If you follow the order of tags of this snippet:

  • <td> is opened
  • <a> is opened
  • <b> is opened
  • <font> is opened

Technically, tags should be closed in the opposite order, but this is how they are closed:

  • </b> is closed
  • </font> is closed
  • </a> is closed
  • </td> is closed

The HTML you are trying to scrape is full of those mistakes, all well as closing tags for tags which are never opened. Simple_HTML_DOM doesn't parse those files properly.

I'm afraid that if you don't have the possibility of modifying the HTML, you'll have to parse the file manually, correcting any errors.


As a note, I've tested your code against the following corrected HTML, and Simple_HTML_DOM parsed it successfully, and your code worked just fine.

<tr valign=top>
<td colspan=16 bgcolor=#999999><b>THOROUGHBRED MEETINGS</b></td>

</tr>
<tr valign=top bgcolor="#ffffff">
<td><b>BR</b> <a href="meeting?mtg=br&day=today&curtype=0">SUNSHINE COAST</a></td>
<td><b>FINE/DEAD</b></td>
<td><font color=#cc0000><b>R1</font></b>@<b>12:30pm</b></td>
<td align=center bgcolor=#cc0000><a href="odds?mting=BR01000"><b><font color=#ffffff>1</a></b></font></td>
<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</font></b></a></td>
<td align=center><a href="odds?mting=BR03000"><b><font color=black>3</font></b></a></td>

<td align=center><a href="odds?mting=BR04000"><b><font color=black>4</font></b></a></td>
<td align=center><a href="odds?mting=BR05000"><b><font color=black>5</font></b></a></td>
<td align=center><a href="odds?mting=BR06000"><b><font color=black>6</font></b></a></td>
<td align=center><a href="odds?mting=BR07000"><b><font color=black>7</font></b></a></td>
<td align=center><a href="odds?mting=BR08000"><b><font color=black>8</font></b></a></td>
<td bgcolor="#ffffff" colspan=4> </td>
</tr>


Edit: As an alternative, you might want to try if DOMDocument::loadHTML has better results. It is available in PHP 5 without external libraries. Check the official documentation.

这篇关于无法使用 simplehtmldom 正确分隔单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆