PHP Web抓取Javascript生成的内容 [英] PHP Web scraping of Javascript generated contents

查看:118
本文介绍了PHP Web抓取Javascript生成的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我想从$ html的链接中获取数据,tr和td的所有表格内容,在这里我试图抓住链接,但它只显示javascript:self.close()

 <?php 
包括(simple_html_dom.php);

$ html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');

foreach($ html-> find('a')as $ element)
echo $ element-> href。 <峰; br>;


?>


解决方案

通常,这种页面会加载一堆Javascript (jQuery等),然后构建接口并从数据源中检索要显示的数据。



所以你需要什么要做的就是使用Firebug等工具打开Firefox或类似页面,以查看实际正在进行的请求。如果你很幸运,你会直接在XHR请求列表中找到它。在这种情况下:

  http://www.govliquidation.com/json/buyer_ux/salescalendar.js 

请注意,此操作过程可能会侵犯某些许可或使用条款。继续操作前,请先向网站管理员/数据源/版权所有者清楚:检测并禁止这种刮擦非常容易,识别的可能只是稍微少一点。 / p>

无论如何,如果您在PHP中发出相同的调用,您可以直接刮取数据(假设没有会话/身份验证问题,如此处所示)代码:

 <?php 

$ url =http://www.govliquidation。 COM / JSON / buyer_ux / salescalendar.js;

$ json = file_get_contents($ url);

$ data = json_decode($ json);

?>

这会产生一个数据对象,您可以通过简单循环检查并转换为CSV。

  stdClass对象

[result] => stdClass对象

[events ] =>数组

[0] => stdClass对象

[yahoo_dur] => 11300
[closing_today] => 0
[language_code] => en
[mixed_id] => 9297
[event_id] => 9297
[close_meridian] => PM
[commercial_sale_flag] => 0
[close_time] => 01/06/2014
[award_time_unixtime] => 1389070800
[类别] =>轮胎,零件和部件
[open_time_unixtime] => 1388638800
[yahoo_date] => 20140102T000000Z
[open_time] => 01/02/2014
[event_close_time] => 2014-01-06 17:00:00
[display_event_id] => 9297
[type_code] => X3
[title] => Truck Drive Axles @ Killeen,TX
[special_flag] => 1
[demil_flag] => 0
[google_close] => 20140106
[event_open_time] => 2014-01-02 00:00:00
[google_open] => 20140102
[third_party_url] =>
[bid_package_flag] => 0
[is_open] => 1
[fda_count] => 0
[close_time_unixtime] => 1389045600

您检索 $ data-> result-> events ,使用 fputcsv()转换为数组形式,Bob是你的叔叔。


I am stuck with a scraping task in my project.

i want to grab the data from the link in $html , all table content of tr and td , here i am trying to grab the link but it only shows javascript: self.close()

<?php
include("simple_html_dom.php");

$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');

 foreach($html->find('a') as $element)
   echo $element->href . '<br>'; 


  ?>

解决方案

Usually, this kind of pages load a bunch of Javascript (jQuery, etc.), which then builds the interface and retrieves the data to be displayed from a data source.

So what you need to do is open that page in Firefox or similar, with a tool such as Firebug in order to see what requests are actually being done. If you're lucky, you will find it directly in the list of XHR requests. As in this case:

http://www.govliquidation.com/json/buyer_ux/salescalendar.js

Notice that this course of action may infringe on some license or terms of use. Clear this with the webmaster/data source/copyright owner before proceeding: detecting and forbidding this kind of scraping is very easy, and identifying you is probably only slightly less so.

Anyway, if you issue the same call in PHP, you can directly scrape the data (provided there is no session/authentication issue, as seems the case here) with very simple code:

<?php

    $url = "http://www.govliquidation.com/json/buyer_ux/salescalendar.js";

    $json = file_get_contents($url);

    $data = json_decode($json);

?>

This yields a data object that you can inspect and convert in CSV by simple looping.

stdClass Object
(
    [result] => stdClass Object
        (
            [events] => Array
                (
                    [0] => stdClass Object
                        (
                            [yahoo_dur] => 11300
                            [closing_today] => 0
                            [language_code] => en
                            [mixed_id] => 9297
                            [event_id] => 9297
                            [close_meridian] => PM
                            [commercial_sale_flag] => 0
                            [close_time] => 01/06/2014
                            [award_time_unixtime] => 1389070800
                            [category] => Tires, Parts & Components
                            [open_time_unixtime] => 1388638800
                            [yahoo_date] => 20140102T000000Z
                            [open_time] => 01/02/2014
                            [event_close_time] => 2014-01-06 17:00:00
                            [display_event_id] => 9297
                            [type_code] => X3
                            [title] => Truck Drive Axles @ Killeen, TX
                            [special_flag] => 1
                            [demil_flag] => 0
                            [google_close] => 20140106
                            [event_open_time] => 2014-01-02 00:00:00
                            [google_open] => 20140102
                            [third_party_url] =>
                            [bid_package_flag] => 0
                            [is_open] => 1
                            [fda_count] => 0
                            [close_time_unixtime] => 1389045600

You retrieve $data->result->events, use fputcsv() on its items converted to array form, and Bob's your uncle.

这篇关于PHP Web抓取Javascript生成的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆