如何从需要cookie登录的网站中抓取PHP中的网站内容？ [英] How can I scrape website content in PHP from a website that requires a cookie login?

查看：140 发布时间：2017/1/6 9:53:36 php cookies scraper snoopy goutte

本文介绍了如何从需要cookie登录的网站中抓取PHP中的网站内容？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的问题是，它不只需要一个基本的cookie，而是要求一个会话cookie和随机生成的ID。我想这意味着我需要使用带有cookie jar的web浏览器模拟器？

My problem is that it doesn't just require a basic cookie, but rather asks for a session cookie, and for randomly generated IDs. I think this means I need to use a web browser emulator with a cookie jar?

我试图使用Snoopy，Goutte和其他几个网络浏览器模拟器，但是因为我还没有能够找到如何接收cookies的教程。

I have tried to use Snoopy, Goutte and a couple of other web browser emulators, but as of yet I have not been able to find tutorials on how to receive cookies. I am getting a little desperate!

任何人都可以给我一个如何接受Snoopy或Goutte中的Cookie的示例？

Can anyone give me an example of how to accept cookies in Snoopy or Goutte?

提前感谢！

面向对象的回答

我们在一个名为 Browser 的类中实现尽可能多的上述答案，该类应提供正常的导航功能。

Object-Oriented answer

We implement as much as possible of the above answer in one class called Browser that should supply the normal navigation features.

然后，我们应该能够把一个非常简单的形式的站点特定的代码，放在一个新的派生类，我们称之为 FooBrowser ，执行刮 Foo 。

Then we should be able to put the site-specific code, in very simple form, in a new derived class that we call, say, FooBrowser, that performs scraping of the site Foo.

导出浏览器的类必须提供一些特定于站点的函数，例如 path（）函数允许存储站点特定的信息，例如

The class deriving Browser must supply some site-specific function such as a path() function allowing to store site-specific information, for example

function path($basename) {
    return '/var/tmp/www.foo.bar/' . $basename;
}

abstract class Browser
{
    private $options = [];
    private $state   = [];
    protected $cookies;

    abstract protected function path($basename);

    public function __construct($site, $options = []) {
        $this->cookies   = $this->path('cookies');
        $this->options  = array_merge(
            [
                'site'      => $site,
                'userAgent' => 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 - LeoScraper',
                'waitTime'  => 250000,
            ],
            $options
        );
        $this->state = [
            'referer' => '/',
            'url'     => '',
            'curl'    => '',
        ];
        $this->__wakeup();
    }

    /**
     * Reactivates after sleep (e.g. in session) or creation
     */
    public function __wakeup() {
        $this->state['curl'] = curl_init();
        $this->config([
            CURLOPT_USERAGENT       => $this->options['userAgent'],
            CURLOPT_ENCODING        => '',
            CURLOPT_NOBODY          => false,
            // ...retrieving the body...
            CURLOPT_BINARYTRANSFER  => true,
            // ...as binary...
            CURLOPT_RETURNTRANSFER  => true,
            // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => true,
            // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,
            // ...reasonably...
            CURLOPT_COOKIEFILE      => $this->cookies,
            // Save these cookies
            CURLOPT_COOKIEJAR       => $this->cookies,
            // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,
            // Seconds
            CURLOPT_TIMEOUT         => 300,
            // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,
            // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,
        ]);
    }

    /**
     * Imports an options array.
     *
     * @param array $opts
     * @throws DetailedError
     */
    private function config(array $opts = []) {
        foreach ($opts as $key => $value) {
            if (true !== curl_setopt($this->state['curl'], $key, $value)) {
                throw new \Exception('Could not set cURL option');
            }
        }
    }

    private function perform($url) {
        $this->state['referer'] = $this->state['url'];
        $this->state['url'] = $url;
        $this->config([
            CURLOPT_URL     => $this->options['site'] . $this->state['url'],
            CURLOPT_REFERER => $this->options['site'] . $this->state['referer'],
        ]);
        $response = curl_exec($this->state['curl']);
        // Should we ever want to randomize waitTime, do so here.
        usleep($this->options['waitTime']);

        return $response;
    }

    /**
     * Returns a configuration option.
     * @param string $key       configuration key name
     * @param string $value     value to set
     * @return mixed
     */
    protected function option($key, $value = '__DEFAULT__') {
        $curr   = $this->options[$key];
        if ('__DEFAULT__' !== $value) {
            $this->options[$key]    = $value;
        }
        return $curr;
    }

    /**
     * Performs a POST.
     *
     * @param $url
     * @param $fields
     * @return mixed
     */
    public function post($url, array $fields) {
        $this->config([
            CURLOPT_POST       => true,
            CURLOPT_POSTFIELDS => http_build_query($fields),
        ]);
        return $this->perform($url);
    }

    /**
     * Performs a GET.
     *
     * @param       $url
     * @param array $fields
     * @return mixed
     */
    public function get($url, array $fields = []) {
        $this->config([ CURLOPT_POST => false ]);
        if (empty($fields)) {
            $query = '';
        } else {
            $query = '?' . http_build_query($fields);
        }
        return $this->perform($url . $query);
    }
}

现在要抓取FooSite：

Now to scrape FooSite:

/* WWW_FOO_COM requires username and password to construct */

class WWW_FOO_COM_Browser extends Browser
{
    private $loggedIn   = false;

    public function __construct($username, $password) {
        parent::__construct('http://www.foo.bar.baz', [
            'username'  => $username,
            'password'  => $password,
            'waitTime'  => 250000,
            'userAgent' => 'FooScraper',
            'cache'     => true
        ]);
        // Open the session
        $this->get('/');
        // Navigate to the login page
        $this->get('/login.do');
    }

    /**
     * Perform login.
     */
    public function login() {
        $response = $this->post(
            '/ajax/loginPerform',
            [
                'j_un'    => $this->option('username'),
                'j_pw'    => $this->option('password'),
            ]
        );
        // TODO: verify that response is OK.
        // if (!strstr($response, "Welcome " . $this->option('username'))
        //     throw new \Exception("Bad username or password")
        $this->loggedIn = true;
        return true;
    }

    public function scrape($entry) {
        // We could implement caching to avoid scraping the same entry
        // too often. Save $data into path("entry-" . md5($entry))
        // and verify the filemtime of said file, is it newer than time()
        // minus, say, 86400 seconds? If yes, return file_get_content and
        // leave remote site alone.
        $data = $this->get(
            '/foobars/baz.do',
            [
                'ticker' => $entry
            ]
        );
        return $data;
    }

现在实际的抓取代码是：

Now the actual scraping code would be:

    $scraper = new WWW_FOO_COM_Browser('lserni', 'mypassword');
    if (!$scraper->login()) {
        throw new \Exception("bad user or pass");
    }
    foreach ($entries as $entry) {
        $html = $scraper->scrape($entry);
        // Parse HTML
    }

强制通知：使用合适的解析器从原始HTML获取数据。

这篇关于如何从需要cookie登录的网站中抓取PHP中的网站内容？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从需要cookie登录的网站中抓取PHP中的网站内容？ [英] How can I scrape website content in PHP from a website that requires a cookie login?

问题描述

推荐答案

面向对象的回答

Object-Oriented answer

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

如何从需要cookie登录的网站中抓取PHP中的网站内容？ [英] How can I scrape website content in PHP from a website that requires a cookie login?

问题描述

推荐答案

面向对象的回答

Object-Oriented answer

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭