如何在PHP中实现网页抓取工具? [英] How to implement a web scraper in PHP?

查看:104
本文介绍了如何在PHP中实现网页抓取工具?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

哪些内置PHP函数对Web抓取有用?有什么好的资源(网络或印刷)可帮助您快速掌握PHP的网络抓取功能?

What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?

推荐答案

抓取通常包括3个步骤:

Scraping generally encompasses 3 steps:

  • 首先您要获取或发布您的请求 到指定的URL
  • 接下来您会收到 作为返回的html 回应
  • 最后,您解析出 该html您想要的文本 刮擦.
  • first you GET or POST your request to a specified URL
  • next you receive the html that is returned as the response
  • finally you parse out of that html the text you'd like to scrape.

要完成第1步和第2步,下面是一个简单的php类,该类使用Curl通过GET或POST获取网页.取回HTML后,您只需使用正则表达式通过解析您要抓取的文本来完成第3步.

To accomplish steps 1 and 2, below is a simple php class which uses Curl to fetch webpages using either GET or POST. After you get the HTML back, you just use Regular Expressions to accomplish step 3 by parsing out the text you'd like to scrape.

对于正则表达式,我最喜欢的教程站点如下: 正则表达式教程

For regular expressions, my favorite tutorial site is the following: Regular Expressions Tutorial

我最喜欢使用RegExs的程序是 Regex Buddy .即使您无意购买该产品,我也建议您尝试该产品的演示.它是一种无价的工具,甚至可以为您使用选择的语言(包括php)制作的正则表达式生成代码.

My Favorite program for working with RegExs is Regex Buddy. I would advise you to try the demo of that product even if you have no intention of buying it. It is an invaluable tool and will even generate code for your regexs you make in your language of choice (including php).

用法:

$curl = new Curl(); $html = $curl->get("http://www.google.com");

$curl = new Curl(); $html = $curl->get("http://www.google.com");

// now, do your regex work against $html

// now, do your regex work against $html

PHP类:



<?php

class Curl
{       

    public $cookieJar = "";

    public function __construct($cookieJarFile = 'cookies.txt') {
        $this->cookieJar = $cookieJarFile;
    }

    function setup()
    {


        $header = array();
        $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
        $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] =  "Cache-Control: max-age=0";
        $header[] =  "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank.


        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
        curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
        curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); 
        curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);
        curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
        curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);  
    }


    function get($url)
    { 
        $this->curl = curl_init($url);
        $this->setup();

        return $this->request();
    }

    function getAll($reg,$str)
    {
        preg_match_all($reg,$str,$matches);
        return $matches[1];
    }

    function postForm($url, $fields, $referer='')
    {
        $this->curl = curl_init($url);
        $this->setup();
        curl_setopt($this->curl, CURLOPT_URL, $url);
        curl_setopt($this->curl, CURLOPT_POST, 1);
        curl_setopt($this->curl, CURLOPT_REFERER, $referer);
        curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
        return $this->request();
    }

    function getInfo($info)
    {
        $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
        return $info;
    }

    function request()
    {
        return curl_exec($this->curl);
    }
}

?>

这篇关于如何在PHP中实现网页抓取工具?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆