使用DOM PHP网页抓取工具从论坛网站进行选择性数据提取 [英] Selective data extraction from forum site using DOM PHP web crawler

查看:162
本文介绍了使用DOM PHP网页抓取工具从论坛网站进行选择性数据提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个PHP dom网页抓取工具正常工作。它提取了提及的标签以及从(外部)论坛站点到我的页面的链接。



但是最近我遇到了一个问题。
喜欢



这是论坛数据的HTML ::

 < TBODY> 
< tr>
< td width =1%height =25>& nbsp;< / td>
< td width =64%height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837880.phptarget =_ top =Links2>西班牙裔学习合作伙伴< / a> - dreamer1984< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =14%height =25class =FootNotes2align =center> 02/28/17 01:42< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =8%height =25align =Centerclass =FootNotes2> 0< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =9%height =25align =Centerclass =FootNotes2> 200< / td>
< / tr>
< tr>
< td width =1%height =25>& nbsp;< / td>
< td width =64%height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837879.phptarget =_ top = Links2 > nbme< / A> - monariyadh< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =14%height =25class =FootNotes2align =center> 02/27/17 23:12< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =8%height =25align =Centerclass =FootNotes2> 0< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =9%height =25align =Centerclass =FootNotes2> 108< / td>
< / tr>
< / tbody>

现在,如果我们将上述代码(表数据)视为该站点中可用的唯一语句。如果我试图用网络抓取工具提取它,例如

 <?php 
require_once('dom / simple_html_dom.php');
$ html = file_get_html('http://www.sitename.com/');
foreach($ html-> find('td.FootNotes2')as $ element){
echo $ element;
}
?>

它将类名称中的内部数据提取为FootNote2



现在,如果我想提取标签中的特定数据,如
,例如名为dreamer1984和monariyadh的第一个标签/行。



如果我想从第3个(跳过其余的)提取具有相同类名的数据,该怎么办。



希望我做出问题清楚了解。



任何帮助都赞赏..

解决方案

我建议您使用正则表达式



这是您需要的示例

  $ subject =<< EOF 
< TBODY>
< tr>
< td width =1%height =25>& nbsp;< / td>
< td width =64%height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837880.phptarget =_ top =Links2>西班牙裔学习合作伙伴< / a> - dreamer1984< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =14%height =25class =FootNotes2align =center> 02/28/17 01:42< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =8%height =25align =Centerclass =FootNotes2> 0< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =9%height =25align =Centerclass =FootNotes2> 200< / td>
< / tr>
< tr>
< td width =1%height =25>& nbsp;< / td>
< td width =64%height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837879.phptarget =_ top = Links2 > nbme< / A> - monariyadh< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =14%height =25class =FootNotes2align =center> 02/27/17 23:12< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =8%height =25align =Centerclass =FootNotes2> 0< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =9%height =25align =Centerclass =FootNotes2> 108< / td>
< / tr>
< / tbody>
EOF;

preg_match_all('/< td。+?FootNotes2。+?< a。+?< \ / a> - (?P< name>。*?) /td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d { 2}:\d {2})/ siu',$ subject,$ matchs);

foreach($ matchs ['name'] as $ k => $ v){
var_dump('name:'。$ v,'relative date:'。$ matchs [ '日期'] [$ K]);
}


I have this PHP dom web crawler which works fine. it extracts mentioned tag along with its link from a (external) forum site to my page.

But recently i ran into a problem. Like

this is the HTML of the forum data::

<tbody>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>

Now if we consider the above code (table data) as the only statements available in that site. and if i tried to extract it with a web crawler like,

<?php
    require_once('dom/simple_html_dom.php'); 
    $html = file_get_html('http://www.sitename.com/');
    foreach($html->find('td.FootNotes2') as $element) {
    echo $element;
}
?>

It extracts al the data that is inside with a class name as "FootNote2"

Now what if i want to extract specific data in tag, for example names like, " dreamer1984" and "monariyadh" from the first tag/line.

and what if i wanted to extract data from 3rd (skipping the rest) which has same class names.

Hope i made the problem clear to understand.

Any help is appreciated..

解决方案

I suggest to you use regex.

this is example of what you need

$subject = <<<EOF
<tbody>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>
EOF;

preg_match_all('/<td.+?FootNotes2.+?<a.+?<\/a> - (?P<name>.*?)<\/td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2})/siu', $subject, $matchs);

foreach ($matchs['name'] as $k => $v){
    var_dump('name: '. $v, 'relative date: '. $matchs['date'][$k]);
}

这篇关于使用DOM PHP网页抓取工具从论坛网站进行选择性数据提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆