使用DOM PHP网页抓取工具从论坛网站进行选择性数据提取 [英] Selective data extraction from forum site using DOM PHP web crawler

查看：162 发布时间：2017/6/28 19:01:09 php dom web-crawler

本文介绍了使用DOM PHP网页抓取工具从论坛网站进行选择性数据提取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个PHP dom网页抓取工具正常工作。它提取了提及的标签以及从（外部）论坛站点到我的页面的链接。

但是最近我遇到了一个问题。
喜欢

这是论坛数据的HTML ::

 < TBODY> 
< tr> 
< td width =1％height =25>& nbsp;< / td> 
< td width =64％height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837880.phptarget =_ top =Links2>西班牙裔学习合作伙伴< / a> -  dreamer1984< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =14％height =25class =FootNotes2align =center> 02/28/17 01：42< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =8％height =25align =Centerclass =FootNotes2> 0< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =9％height =25align =Centerclass =FootNotes2> 200< / td> 
< / tr> 
< tr> 
< td width =1％height =25>& nbsp;< / td> 
< td width =64％height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837879.phptarget =_ top = Links2 > nbme< / A> -  monariyadh< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =14％height =25class =FootNotes2align =center> 02/27/17 23：12< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =8％height =25align =Centerclass =FootNotes2> 0< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =9％height =25align =Centerclass =FootNotes2> 108< / td> 
< / tr> 
< / tbody>

现在，如果我们将上述代码（表数据）视为该站点中可用的唯一语句。如果我试图用网络抓取工具提取它，例如

 <？php 
 require_once（'dom / simple_html_dom.php'）; 
 $ html = file_get_html（'http://www.sitename.com/'）; 
 foreach（$ html-> find（'td.FootNotes2'）as $ element）{
 echo $ element; 
} 
？>

它将类名称中的内部数据提取为FootNote2

现在，如果我想提取标签中的特定数据，如
，例如名为dreamer1984和monariyadh的第一个标签/行。

如果我想从第3个（跳过其余的）提取具有相同类名的数据，该怎么办。

希望我做出问题清楚了解。

任何帮助都赞赏..

解决方案

我建议您使用正则表达式。

这是您需要的示例

  $ subject =<< EOF 
< TBODY> 
< tr> 
< td width =1％height =25>& nbsp;< / td> 
< td width =64％height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837880.phptarget =_ top =Links2>西班牙裔学习合作伙伴< / a> -  dreamer1984< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =14％height =25class =FootNotes2align =center> 02/28/17 01：42< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =8％height =25align =Centerclass =FootNotes2> 0< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =9％height =25align =Centerclass =FootNotes2> 200< / td> 
< / tr> 
< tr> 
< td width =1％height =25>& nbsp;< / td> 
< td width =64％height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837879.phptarget =_ top = Links2 > nbme< / A> -  monariyadh< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =14％height =25class =FootNotes2align =center> 02/27/17 23：12< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =8％height =25align =Centerclass =FootNotes2> 0< / td> 
< td width =1％height =25>& nbsp;< / td> 
< td width =9％height =25align =Centerclass =FootNotes2> 108< / td> 
< / tr> 
< / tbody> 
 EOF; 
 
 preg_match_all（'/< td。+？FootNotes2。+？< a。+？< \ / a>  - （？P< name>。*？） /td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d { 2}：\d {2}）/ siu'，$ subject，$ matchs）; 
 
 foreach（$ matchs ['name'] as $ k => $ v）{
 var_dump（'name：'。$ v，'relative date：'。$ matchs [ '日期'] [$ K]）; 
}

I have this PHP dom web crawler which works fine. it extracts mentioned tag along with its link from a (external) forum site to my page.

But recently i ran into a problem. Like

this is the HTML of the forum data::

<tbody>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>

Now if we consider the above code (table data) as the only statements available in that site. and if i tried to extract it with a web crawler like,

<?php
    require_once('dom/simple_html_dom.php'); 
    $html = file_get_html('http://www.sitename.com/');
    foreach($html->find('td.FootNotes2') as $element) {
    echo $element;
}
?>

It extracts al the data that is inside with a class name as "FootNote2"

Now what if i want to extract specific data in tag, for example names like, " dreamer1984" and "monariyadh" from the first tag/line.

and what if i wanted to extract data from 3rd (skipping the rest) which has same class names.

Hope i made the problem clear to understand.

Any help is appreciated..

解决方案

I suggest to you use regex.

this is example of what you need

$subject = <<<EOF
<tbody>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>
EOF;

preg_match_all('/<td.+?FootNotes2.+?<a.+?<\/a> - (?P<name>.*?)<\/td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2})/siu', $subject, $matchs);

foreach ($matchs['name'] as $k => $v){
    var_dump('name: '. $v, 'relative date: '. $matchs['date'][$k]);
}

这篇关于使用DOM PHP网页抓取工具从论坛网站进行选择性数据提取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用DOM PHP网页抓取工具从论坛网站进行选择性数据提取 [英] Selective data extraction from forum site using DOM PHP web crawler

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

使用DOM PHP网页抓取工具从论坛网站进行选择性数据提取 [英] Selective data extraction from forum site using DOM PHP web crawler

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭