PHP Web 抓取 Javascript 生成的内容 [英] PHP Web scraping of Javascript generated contents

查看:30
本文介绍了PHP Web 抓取 Javascript 生成的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在我的项目中遇到了抓取任务.

I am stuck with a scraping task in my project.

我想从 $html 中的链接中获取数据,tr 和 td 的所有表格内容,在这里我试图获取链接,但它只显示 javascript:self.close()

i want to grab the data from the link in $html , all table content of tr and td , here i am trying to grab the link but it only shows javascript: self.close()

<?php
include("simple_html_dom.php");

$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');

 foreach($html->find('a') as $element)
   echo $element->href . '<br>'; 


  ?>

推荐答案

通常,这类页面加载一堆 Javascript(jQuery 等),然后构建界面并从 数据源.

Usually, this kind of pages load a bunch of Javascript (jQuery, etc.), which then builds the interface and retrieves the data to be displayed from a data source.

因此,您需要做的是在 Firefox 或类似工具中使用 Firebug 等工具打开该页面,以查看实际执行了哪些请求.如果幸运的话,您会直接在 XHR 请求列表中找到它.在这种情况下:

So what you need to do is open that page in Firefox or similar, with a tool such as Firebug in order to see what requests are actually being done. If you're lucky, you will find it directly in the list of XHR requests. As in this case:

http://www.govliquidation.com/json/buyer_ux/salescalendar.js

请注意,此操作过程可能会侵犯某些许可或使用条款.在继续之前,请与网站管理员/数据源/版权所有者明确这一点:检测和禁止这种抓取非常容易,而识别可能只是稍微不那么容易.

Notice that this course of action may infringe on some license or terms of use. Clear this with the webmaster/data source/copyright owner before proceeding: detecting and forbidding this kind of scraping is very easy, and identifying you is probably only slightly less so.

无论如何,如果您在 PHP 中发出相同的调用,您可以使用非常简单的代码直接抓取数据(前提是没有会话/身份验证问题,就像这里的情况一样):

Anyway, if you issue the same call in PHP, you can directly scrape the data (provided there is no session/authentication issue, as seems the case here) with very simple code:

<?php

    $url = "http://www.govliquidation.com/json/buyer_ux/salescalendar.js";

    $json = file_get_contents($url);

    $data = json_decode($json);

?>

这会生成一个数据对象,您可以通过简单的循环在 CSV 中检查和转换该数据对象.

This yields a data object that you can inspect and convert in CSV by simple looping.

stdClass Object
(
    [result] => stdClass Object
        (
            [events] => Array
                (
                    [0] => stdClass Object
                        (
                            [yahoo_dur] => 11300
                            [closing_today] => 0
                            [language_code] => en
                            [mixed_id] => 9297
                            [event_id] => 9297
                            [close_meridian] => PM
                            [commercial_sale_flag] => 0
                            [close_time] => 01/06/2014
                            [award_time_unixtime] => 1389070800
                            [category] => Tires, Parts & Components
                            [open_time_unixtime] => 1388638800
                            [yahoo_date] => 20140102T000000Z
                            [open_time] => 01/02/2014
                            [event_close_time] => 2014-01-06 17:00:00
                            [display_event_id] => 9297
                            [type_code] => X3
                            [title] => Truck Drive Axles @ Killeen, TX
                            [special_flag] => 1
                            [demil_flag] => 0
                            [google_close] => 20140106
                            [event_open_time] => 2014-01-02 00:00:00
                            [google_open] => 20140102
                            [third_party_url] =>
                            [bid_package_flag] => 0
                            [is_open] => 1
                            [fda_count] => 0
                            [close_time_unixtime] => 1389045600

您检索$data->result->events,对转换为数组形式的项目使用fputcsv(),Bob 是您的叔叔.

You retrieve $data->result->events, use fputcsv() on its items converted to array form, and Bob's your uncle.

这篇关于PHP Web 抓取 Javascript 生成的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆