如何使用PHP刮擦无限滚动的网页? [英] How can be scraped using PHP curl a webpage with infinite scroll?

查看:154
本文介绍了如何使用PHP刮擦无限滚动的网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何在一个具有无限循环(例如imgur)的网页中循环抓取(第1页,第2页等)。?

I'd like to know how can be scraped in a loop (page 1 page 2etc....) a webpage which has infinite loops (like imgur) for example ... ?

我尝试了以下代码,但仅返回第一页。

I tried the code below, but it returns only the first page. How can I trigger the next page due to infinite scrolling template?

<?php
    $mr = $maxredirect === null ? 10 : intval($maxredirect);
    if (ini_get('open_basedir') == '' && ini_get('safe_mode' == 'Off')) {
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $mr > 0);
        curl_setopt($ch, CURLOPT_MAXREDIRS, $mr);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    } else {
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);

        if ($mr > 0) {
            $original_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
            $newurl = $original_url;
            $rch = curl_copy_handle($ch);

            curl_setopt($rch, CURLOPT_HEADER, true);
            curl_setopt($rch, CURLOPT_NOBODY, true);
            curl_setopt($rch, CURLOPT_FORBID_REUSE, false);
            do {
                curl_setopt($rch, CURLOPT_URL, $newurl);
                $header = curl_exec($rch);
                if (curl_errno($rch)) {
                    $code = 0;
                } else {
                    $code = curl_getinfo($rch, CURLINFO_HTTP_CODE);
                    if ($code == 301 || $code == 302) {
                        preg_match('/Location:(.*?)\n/', $header, $matches);
                        $newurl = trim(array_pop($matches));

                        // if no scheme is present then the new url is a
                        // relative path and thus needs some extra care
                        if(!preg_match("/^https?:/i", $newurl)){
                            $newurl = $original_url . $newurl;
                        }
                    } else {
                        $code = 0;
                    }
                }
            } while ($code && --$mr);
            curl_close($rch);
            if (!$mr) {
                if ($maxredirect === null)
                    trigger_error('Too many redirects.', E_USER_WARNING);
                else
                    $maxredirect = 0;
                return false;
            }
            curl_setopt($ch, CURLOPT_URL, $newurl);
        }
    }
    return curl_exec($ch);
}

$ch = curl_init('http://www.imgur.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec_follow($ch);
curl_close($ch);

echo $data;
?>


推荐答案

cURL 通过获取网页的源代码来工作。您的代码将仅从原始网页收集HTML。对于imgur,它将包含约40张图像,以及其余的页面布局。

cURL works by getting the source code of a webpage. Your code will gather the HTML only from the original webpage. In the case of imgur, it will include ~40 images, plus the rest of the page layout.

当您向下滚动时,原始源代码不会改变。但是,浏览器内部的HTML可以。这是通过AJAX完成的。您正在查看的页面从第二个页面请求信息。

This original source code doesn't change when you scroll down. However, the HTML inside of your browser does. This is done with AJAX. The page that you are looking at requests information from a second page.

如果您使用FireBug(用于FireFox)或Google Chrome的页面检查器,则可以通过以下方式监视这些请求:分别转到网络或网络标签。当您向下滚动时,该页面将再次发出约45个请求(主要用于图像)。您还将看到它请求此页面:

If you use FireBug (for FireFox) or Google Chrome's page inspector, then you can monitor these requests by going to the Net or Network tab (respectively). When you scroll down, the page will make another ~45 requests or so (mostly for images). You'll also see that it requests this page:

http://imgur.com/gallery/hot/viral/day/page/0?scrolled&set=1

imgur主页上的JavaScript将此HTML附加到主页底部。如果您想获取列表,则可能要查询此页面(或API,如另一个克里斯所述)图片。您可以使用URL末尾的数字来获取更多图像。

The JavaScript on the imgur homepage appends this HTML to the bottom of the home page. You would probably want to query this page (or the API, as the other Chris said) if you want to get a list of images. You can play with the numbers at the end of the URL to get more images.

这篇关于如何使用PHP刮擦无限滚动的网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆