通过DOM解析HTML表 [英] Parsing HTML tables via DOM

查看:96
本文介绍了通过DOM解析HTML表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我相信页面的标记是我遇到的问题的一部分,所以我想我需要发布源代码和JSFiddle
JSFiddle 和原始
GIS页面



我正在尝试获取信息例如名称:和地址:
从底部的表。



尝试解决方案:



我写了下面的代码,希望看到所有的表格数据,但是我正在寻找数据的表返回没有。

 <?php 
$ k = 0;
$ num = 1000;
var_dump(libxml_use_internal_errors(true));
$ domOb = new DOMDocument();
$ html = @ $ domOb-> loadHTMLFile('http://www.gis.catawba.nc.us/website/Parcel/parcel_main.asp?Cmd=query&key=372215634301&type=P' );
$ domOb-> preserveWhiteSpace = false;
$ items = $ domOb-> getElementsByTagName('td');
while($ k <(int)$ num){
echo $ items-> item($ k ++) - > nodeValue。'< br>';
};
?>

所有返回的是:



bool (false)
房地产搜索 - 遗产
地图层
可见
常见问题的
帮助
GIS首页



所以我希望有人可以告诉我我在做错什么错过我正在寻找的所有数据?
我如何轻松/简单地拉取名称和地址?



尝试使用Xpath,但是收到很多警告...

  $ dom = new DOMDocument; 
$ dom-> load('http://www.gis.catawba.nc.us/website/Parcel/parcel_main.asp?Cmd=query&key=372215634301&type=P');
$ s = simplexml_import_dom($ dom);

echo $ name = $ s-> xpath('// table [@ class =words13] / td [contains(text(),Name:)]');
echo $ add = $ s-> xpath('// table [@ class =words13] / td [contains(text(),Address :)]');

使用user2518542的代码并结合hakre代码,我得到以下

  $ ch = curl_init(); 
curl_setopt($ ch,CURLOPT_URLhttp://www.gis.catawba.nc.us/website/Parcel/parcel_main.asp?Cmd=QUERY&key=372215634301&type=P&width=1280&高度= 923\" );
curl_setopt($ ch,CURLOPT_TIMEOUT,30); // 30秒后超时
curl_setopt($ ch,CURLOPT_RETURNTRANSFER,1);
$ result = curl_exec($ ch);
curl_close($ ch);
$ doc-> loadHTML($ result);

$ tds = $ doc-> getElementsByTagname('td');
foreach($ tds as $ td){
printf(*%s\\\
,$ td-> textContent);
echo'< br>';
}

以下成功打印出所有标签。

解决方案

您正在寻找的表单元格不是该HTML文档的一部分。您首先需要了解webscraping的基础知识,我建议您借用一些有关该主题的书籍并阅读它们。



库的时间;)






如果表单元格在文档中(似乎有所不同,有时候它们有时不是),原始示例显示,这也演示了如何迭代一个 DOMNodeList :

  $ doc = new DOMDocument (); 

libxml_use_internal_errors(true);
$ doc-> loadHTMLFile('Catawba County Legacy Map Server.html');

$ tds = $ doc-> getElementsByTagname('td');
foreach($ tds as $ td){
printf(*%s\\\
,$ td-> textContent);
}

示例性输出:

  phptest.php(在目录:/ home / hakre / php / test)
*
*房地产搜索 - 遗产
*
*
*
*
*
*
*
*
*
*地图图层
*可见
*
*
*包裹
*
*包裹注释
*
*地址点
*
*杂项线
*
*结构
*
*轮廓线
*
*土壤
*
*乡镇
*
*水特征
*
*瓷砖
*
*洪水区
*
*农业区
*
*空中2009
*
*空中2005
*
*空中2002
*
*城市
*
*打印地图
*打印地图和包裹报告
*打印包裹报告
*评估报告
*列出所有者
*契约历史报告
*包裹信息:
*所有者信息:
*包裹ID:372215634301
*名称:PENLEY TREASURE B
*包裹地址:3152 7TH AV SE
* Name2:
*城市:CONOVER 28613
*地址:5508 SWINGING BRIDGE RD
* LRK(REID):57186
*地址2:
*契约书/页:1906/0741 Deed Image
*城市:CONOVER
*细分:FOREST HGTS
*州/邮政编码:NC 28613-7415
*点数:1-4
*
*块:C
*
*最后销售:
*学校资料:
*平面图书/页面:8/119平面图
*学区:COUNTY
*计算面积:0.31
*小学:WEBB A MURRAY
*税务地图:167H 04006A
*中学:ARNDT
*州道:
*高中: ST STEPHENS
*乡镇:HICKORY
*学校地图
*
*
*税/价值信息:税率(pdf)
*分区信息:
*市税区:
*分区区:HICKORY
*消防区:HICKORY RURAL
*分区1:OI
*税务帐号:
*分区2:
*市场大厦价值:$ 55,400
*分区3:
*市场土地价值:$ 20,300
*分区覆盖:
*市场总值:$ 75,700
*小区域:
*建成/改建:1959
*拆分区1/2:0/0
*现行税单
*分区代理电话数字
*其他:
*
*选民区:P35
*公司面板日期:9/5/2007
*此包裹的建筑许可证
*公司面板#:3710372200J
* WaterShed:
* 2010年人口普查区:011000
* WaterShed拆分:
* 2010人口普查区块:3035
*包裹报告数据描述
*农业区:
*常见问题的
*帮助
* GIS首页
编译成功完成。


I believe the mark up of the page is part of the issue I am having, so I think I need to post the source and a JSFiddle JSFiddle and the orginal GIS page

I am trying to get info such as Name: and Address: from the table at the bottom.

attempt at a solution:

I wrote the following code hoping to see all the table data, yet the table I'm looking to get data from returns nothing.

 <?php
 $k=0;
 $num=1000;
 var_dump(libxml_use_internal_errors(true));
 $domOb = new DOMDocument();
 $html = @$domOb->loadHTMLFile('http://www.gis.catawba.nc.us/website/Parcel/parcel_main.asp?Cmd=query&key=372215634301&type=P');
 $domOb->preserveWhiteSpace = false; 
 $items = $domOb->getElementsByTagName('td'); 
 while ($k<(int)$num){
 echo $items->item($k++)->nodeValue.'<br>'; 
 };
 ?>

all that returned was:

bool(false) Real Estate Search - Legacy Map Layers visible FAQ's Help GIS Home

So I'm hoping someone can tell me what I'm doing wrong to miss all the data I'm looking for? How can I pull just the name and address as easily/simply as possible?

attempted the following as well using Xpath but get lots of warning...

 $dom = new DOMDocument;
 $dom->load('http://www.gis.catawba.nc.us/website/Parcel/parcel_main.asp?Cmd=query&key=372215634301&type=P');
 $s = simplexml_import_dom($dom);

 echo $name = $s->xpath('//table[@class="words13]/td[contains(text(), "Name:")]');
 echo $add = $s->xpath('//table[@class="words13]/td[contains(text(), Address:)]');

Using the code by user2518542 and combined with hakre code i get the following

 $ch = curl_init();  
 curl_setopt($ch, CURLOPT_URL,"http://www.gis.catawba.nc.us/website/Parcel/parcel_main.asp?Cmd=QUERY&key=372215634301&type=P&width=1280&height=923");
 curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
 curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
 $result=curl_exec ($ch);
 curl_close ($ch);
 $doc->loadHTML($result);

 $tds = $doc->getElementsByTagname('td');
 foreach($tds as $td) {
 printf(" * %s\n", $td->textContent);
 echo '<br>';
 }

The following successfully prints out all the tags.

解决方案

The table cells you are looking for are not part of that HTML document. You first of all need to understand the basics of webscraping, I suggest you borrow some books about the topic and read through them.

Time for the library ;)


In case the table cells are in the document (it seems to vary, sometimes they are, sometimes they are not), the original example shows it, this also demonstrates how to iterate over a DOMNodeList:

$doc = new DOMDocument();

libxml_use_internal_errors(true);
$doc->loadHTMLFile('Catawba County Legacy Map Server.html');

$tds = $doc->getElementsByTagname('td');
foreach($tds as $td) {
    printf(" * %s\n", $td->textContent);
}

Exemplary output:

php "test.php" (in directory: /home/hakre/php/test)
 *
 * Real Estate Search - Legacy
 *
 *
 *
 *
 *
 *
 *
 *
 *
 * Map Layers
 * visible
 *
 *
 * Parcels
 *
 * Parcel Annotation
 *
 * Address Points
 *
 * Misc. Lines
 *
 * Structures
 *
 * Contour Lines
 *
 * Soils
 *
 * Townships
 *
 * Water Features
 *
 * Tiles
 *
 * Flood Zone
 *
 * Agricultural District
 *
 * Aerial 2009
 *
 * Aerial 2005
 *
 * Aerial 2002
 *
 * Cities
 *
 * Print the Map  
 * Print Map and Parcel Report  
 * Print the Parcel Report  
 * Assessment Report  
 * List all Owners  
 * Deed History Report
 * Parcel Information:
 * Owner Information:
 * Parcel ID: 372215634301
 * Name: PENLEY TREASURE B
 * Parcel Address: 3152 7TH AV SE 
 * Name2:  
 * City: CONOVER 28613
 * Address: 5508 SWINGING BRIDGE RD
 * LRK(REID): 57186
 * Address2:  
 * Deed Book/Page: 1906/0741 Deed Image
 * City: CONOVER
 * Subdivision: FOREST HGTS
 * State/Zip: NC 28613-7415
 * Lots: 1-4
 *
 * Block: C
 *
 * Last Sale:
 * School Information:
 * Plat Book/Page: 8/119 Plat Image
 * School District: COUNTY
 * Calculated Acreage: 0.31
 * Elementary School: WEBB A MURRAY
 * Tax Map: 167H  04006A
 * Middle School: ARNDT
 * State Road:  
 * High School: ST STEPHENS
 * Township: HICKORY
 * School Map
 *  
 *  
 * Tax/Value Information:  Tax Rates(pdf)
 * Zoning Information:
 * Municipal Tax District:  
 * Zoning District: HICKORY
 * Fire District: HICKORY RURAL
 * Zoning1: OI
 * Tax Account Number:  
 * Zoning2:  
 * Market Building(s) Value: $55,400
 * Zoning3:  
 * Market Land Value: $20,300
 * Zoning Overlay:  
 * Market Total Value: $75,700
 * Small Area:  
 * Year Built/Remodeled: 1959  
 * Split Zoning District 1/2: 0/0
 * Current Tax Bill
 * Zoning Agency Phone Numbers
 * Miscellaneous:
 *  
 * Voter Precinct:P35
 * Firm Panel Date: 9/5/2007
 * Building Permits for this parcel
 * Firm Panel #: 3710372200J
 * WaterShed:  
 * 2010 Census Tract: 011000
 * WaterShed Split:  
 * 2010 Census Block: 3035
 * Parcel Report Data Descriptions
 * Agricultural District:  
 * FAQ's
 * Help
 * GIS Home
Compilation finished successfully.

这篇关于通过DOM解析HTML表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆