需要帮助刮网页 - 获取具体内容... [英] Need help scraping webpage -- getting specific content...

查看:111
本文介绍了需要帮助刮网页 - 获取具体内容...的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张表,其列数可以根据报废页面的配置而改变(我没有控制权)。我只想从列标题指定的特定列中获取信息。



这是一个简化表:

 < table> 
< tbody>
< tr class ='header'>
< td> Image< / td>
< td> Name< / td>
< td>时间< / td>
< / tr>
< tr>
< td>< img src ='someimage.png'/>< / td>
< td>名称1< / td>
< td> 13:02< / td>
< / tr>
< tr>
< td>< img src ='someimage.png'/>< / td>
< td>名称2< / td>
< td> 13:43< / td>
< / tr>
< tr>
< td>< img src ='someimage.png'/>< / td>
< td>名称3< / td>
< td> 14:53< / td>
< / tr>
< / tbody>
< / table>

我只想提取表的名称(第2列)。但是,如前所述,列顺序是不可知的。图像列可能不在,例如,在这种情况下,我想要的列将是第一列。



我想知道是否有任何方法可以这样做的DomDocument / DomXPath 。也许在第一个 tr 中搜索字符串Name,并找出它是哪个列索引,然后使用它来获取信息。一个不太优雅的解决方案是查看第一列是否有一个 img 标签,在这种情况下,图像列是第一个,所以我们可以抛出这种方式并使用下一个



看了它一个半小时,但我不熟悉DomDocument的功能和操作。这个问题很麻烦。

解决方案

简单的HTML DOM解析器可能很有用。您可以查看手册。基本上你应该使用类似的东西;

  $ url =file url; 
$ html = file_get_html($ url);
$ header = $ html-> find('tr.header td');
$ i = 0;
foreach($ header as $ element){
if($ element-> innerText =='Image'){$ num = $ i; }
$ i ++;
}

我们发现哪个列($ num)是图像列。您可以添加其他代码来改进。



PS:找到所有图像源的简单方法;

  $ images = $ html-> find('tr td img'); 
foreach($ images as $ image){
$ imageUrl [] = $ image-> src;
}


I have a table, of whose number of columns can change depending on the configuration of the scrapped page (I have no control of it). I want to get only the information from a specific column, designated by the columns heading.

Here is a simplified table:

<table>
<tbody>
<tr class='header'>
    <td>Image</td>
    <td>Name</td>
    <td>Time</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 1</td>
    <td>13:02</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 2</td>
    <td>13:43</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 3</td>
    <td>14:53</td>
</tr>
</tbody>
</table>

I want to only extract the names (column 2) of the table. However, as previously stated, the column order cannot be known. The Image column might not be there, for example, in which case the column I want would be the first one.

I was wondering if there's any way to do this with DomDocument/DomXPath. Perhaps search for the string "Name" in the first tr, and find out which column index it is, and then use that to get the info. A less elegant solution would be to see if the first column has an img tag, in which case the image column is first and so we can throw that way and use the next one.

Been looking at it for about an hour and a half, but I'm not familiar to DomDocument functions and manipulation. Having a lot of trouble with this one.

解决方案

Simple HTML DOM Parser may be useful. You can check the manual. Basically you should use something like;

$url = "file url";
$html = file_get_html($url);
$header = $html->find('tr.header td');
$i = 0;
foreach ($header as $element){
 if ($element->innerText == 'Image') { $num = $i; }
 $i++;
}

We found which column ($num) is image column. You can add additional codes to improve.

PS: Easy way to find all image sources;

$images = $html->find('tr td img');
foreach ($images as $image){
 $imageUrl[] = $image->src;
}

这篇关于需要帮助刮网页 - 获取具体内容...的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆