HTML抓取和CSS查询 [英] html scraping and css queries

查看:97
本文介绍了HTML抓取和CSS查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下库的优点和缺点是什么?


  • PHP简单HTML DOM解析器

  • QP

  • phpQuery




  • 从上面我已经使用过QP,它无法解析无效的HTML和simpleDomParser,它做得很好,但它有点泄漏内存,因为对象模型。但是你可以通过调用 $ object-> clear();来控制它。 unset($ object); 当你不再需要一个对象时。



    还有更多的刮板吗?你与他们的经历是什么?我将把它变成一个社区维基,我们可以建立一个有用的库列表,这些库在抓取时很有用。






    我根据拜伦的回答做了一些测试:

     < 
    include(lib / simplehtmldom / simple_html_dom.php);
    include(lib / phpQuery / phpQuery / phpQuery.php);


    echo< pre>;

    $ html = file_get_contents(http://stackoverflow.com/search?q=favorite+programmer+cartoon);
    $ data ['pq'] = $ data ['dom'] = $ data ['simple_dom'] = array();

    $ timer_start = microtime(true);

    $ dom = new DOMDocument();
    @ $ dom-> loadHTML($ html);
    $ x =新的DOMXPath($ dom);

    foreach($ x-> query(// a)as $ node)
    {
    $ data ['dom'] [] = $ node-> ;的getAttribute(的 href);


    foreach($ x-> query(// img)as $ node)
    {
    $ data ['dom'] [] = $ node-> getAttribute(src);


    foreach($ x-> query(// input)as $ node)
    {
    $ data ['dom'] [] = $ node-> getAttribute(name);
    }

    $ dom_time = microtime(true) - $ timer_start;
    echodom:\ t\t $ dom_time。Got.count($ data ['dom'])。items \\\
    ;






    $ timer_start = microtime(true);
    $ doc = phpQuery :: newDocument($ html);
    foreach($ doc-> find(a)as $ node)
    {
    $ data ['pq'] [] = $ node-> href;
    }

    foreach($ doc-> find(img)as $ node)
    {
    $ data ['pq'] [] = $节点 - > SRC;
    }

    foreach($ doc-> find(input)as $ node)
    {
    $ data ['pq'] [] = $节点 - >名称;
    }
    $ time = microtime(true) - $ timer_start;
    echoPQ:\t\t $ time。Got.count($ data ['pq'])。items \\\
    ;








    $ b $ timer_start = microtime(true);
    $ simple_dom = new simple_html_dom();
    $ simple_dom->加载($ html);
    foreach($ simple_dom-> find(a)as $ node)
    {
    $ data ['simple_dom'] [] = $ node-> href;

    $ b $ foreach($ simple_dom-> find(img)as $ node)
    {
    $ data ['simple_dom'] [] = $节点 - > SRC;
    }

    foreach($ simple_dom-> find(input)as $ node)
    {
    $ data ['simple_dom'] [] = $节点 - >名称;
    }
    $ simple_dom_time = microtime(true) - $ timer_start;
    echosimple_dom:\ t $ simple_dom_time。Got.count($ data ['simple_dom'])。items \\\
    ;


    echo< / pre>;

    得到了

      dom:0.00359296798706。有115件商品
    PQ:0.010568857193。有115件商品
    simple_dom:0.0770139694214。我曾经使用过简单的html dom,但是我已经使用了一些简单的html dom,所以我只需要使用简单的html dom即可。直到一些明亮的SO'ers向我展示光明hallelujah。


    只需使用内置的DOM函数即可。它们是用C语言编写的,也是PHP核心的一部分。它们比任何第三方解决方案效率更高。使用萤火虫,获取XPath查询非常简单。这个简单的改变使得我的基于PHP的刮板运行速度更快,同时节省了我的宝贵时间。

    我的刮板过去需要约60兆字节才能通过curl异步刮擦10个站点。即使你提到了简单的html dom内存修复。



    现在我的php进程永远不会超过8兆字节。



    强烈建议。



    编辑



    好的,我做了一些基准测试。建立在dom中的速度至少要快一个数量级。

     建立在php DOM:0.007061 
    简单的html DOM: 0.117781

    <?
    include(../ lib / simple_html_dom.php);

    $ html = file_get_contents(http://stackoverflow.com/search?q=favorite+programmer+cartoon);
    $ data ['dom'] = $ data ['simple_dom'] = array();

    $ timer_start = microtime(true);

    $ dom = new DOMDocument();
    @ $ dom-> loadHTML($ html);
    $ x =新的DOMXPath($ dom);

    foreach($ x-> query(// a)as $ node)
    {
    $ data ['dom'] [] = $ node-> ;的getAttribute(的 href);


    foreach($ x-> query(// img)as $ node)
    {
    $ data ['dom'] [] = $ node-> getAttribute(src);


    foreach($ x-> query(// input)as $ node)
    {
    $ data ['dom'] [] = $ node-> getAttribute(name);
    }

    $ dom_time = microtime(true) - $ timer_start;

    echobuilt in php DOM:$ dom_time\\\
    ;

    $ timer_start = microtime(true);
    $ simple_dom = new simple_html_dom();
    $ simple_dom->加载($ html);
    foreach($ simple_dom-> find(a)as $ node)
    {
    $ data ['simple_dom'] [] = $ node-> href;

    $ b $ foreach($ simple_dom-> find(img)as $ node)
    {
    $ data ['simple_dom'] [] = $节点 - > SRC;
    }

    foreach($ simple_dom-> find(input)as $ node)
    {
    $ data ['simple_dom'] [] = $节点 - >名称;
    }
    $ simple_dom_time = microtime(true) - $ timer_start;

    echosimple html DOM:$ simple_dom_time\\\
    ;


    what are the advantages and disadvantages of the following libraries?

    From the above i've used QP and it failed to parse invalid HTML, and simpleDomParser, that does a good job, but it kinda leaks memory because of the object model. But you may keep that under control by calling $object->clear(); unset($object); when you dont need an object anymore.

    Are there any more scrapers? What are your experiences with them? I'm going to make this a community wiki, may we'll build a useful list of libraries that can be useful when scraping.


    i did some tests based Byron's answer:

        <?
        include("lib/simplehtmldom/simple_html_dom.php");
        include("lib/phpQuery/phpQuery/phpQuery.php");
    
    
        echo "<pre>";
    
        $html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
        $data['pq'] = $data['dom'] = $data['simple_dom'] = array();
    
        $timer_start = microtime(true);
    
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $x = new DOMXPath($dom);
    
        foreach($x->query("//a") as $node)
        {
             $data['dom'][] = $node->getAttribute("href");
        }
    
        foreach($x->query("//img") as $node)
        {
             $data['dom'][] = $node->getAttribute("src");
        }
    
        foreach($x->query("//input") as $node)
        {
             $data['dom'][] = $node->getAttribute("name");
        }
    
        $dom_time =  microtime(true) - $timer_start;
        echo "dom: \t\t $dom_time . Got ".count($data['dom'])." items \n";
    
    
    
    
    
    
        $timer_start = microtime(true);
        $doc = phpQuery::newDocument($html);
        foreach( $doc->find("a") as $node)
        {
           $data['pq'][] = $node->href;
        }
    
        foreach( $doc->find("img") as $node)
        {
           $data['pq'][] = $node->src;
        }
    
        foreach( $doc->find("input") as $node)
        {
           $data['pq'][] = $node->name;
        }
        $time =  microtime(true) - $timer_start;
        echo "PQ: \t\t $time . Got ".count($data['pq'])." items \n";
    
    
    
    
    
    
    
    
    
        $timer_start = microtime(true);
        $simple_dom = new simple_html_dom();
        $simple_dom->load($html);
        foreach( $simple_dom->find("a") as $node)
        {
           $data['simple_dom'][] = $node->href;
        }
    
        foreach( $simple_dom->find("img") as $node)
        {
           $data['simple_dom'][] = $node->src;
        }
    
        foreach( $simple_dom->find("input") as $node)
        {
           $data['simple_dom'][] = $node->name;
        }
        $simple_dom_time =  microtime(true) - $timer_start;
        echo "simple_dom: \t $simple_dom_time . Got ".count($data['simple_dom'])." items \n";
    
    
        echo "</pre>";
    

    and got

    dom:         0.00359296798706 . Got 115 items 
    PQ:          0.010568857193 . Got 115 items 
    simple_dom:  0.0770139694214 . Got 115 items 
    

    解决方案

    I used to use simple html dom exclusively until some bright SO'ers showed me the light hallelujah.

    Just use the built in DOM functions. They are written in C and part of the PHP core. They are faster more efficient than any 3rd party solution. With firebug, getting an XPath query is muey simple. This simple change has made my php based scrapers run faster, while saving my precious time.

    My scrapers used to take ~ 60 megabytes to scrape 10 sites asyncronously with curl. That was even with the simple html dom memory fix you mentioned.

    Now my php processes never go above 8 megabytes.

    Highly recommended.

    EDIT

    Okay I did some benchmarks. Built in dom is at least an order of magnitude faster.

    Built in php DOM: 0.007061
    Simple html  DOM: 0.117781
    
    <?
    include("../lib/simple_html_dom.php");
    
    $html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
    $data['dom'] = $data['simple_dom'] = array();
    
    $timer_start = microtime(true);
    
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $x = new DOMXPath($dom); 
    
    foreach($x->query("//a") as $node) 
    {
         $data['dom'][] = $node->getAttribute("href");
    }
    
    foreach($x->query("//img") as $node) 
    {
         $data['dom'][] = $node->getAttribute("src");
    }
    
    foreach($x->query("//input") as $node) 
    {
         $data['dom'][] = $node->getAttribute("name");
    }
    
    $dom_time =  microtime(true) - $timer_start;
    
    echo "built in php DOM : $dom_time\n";
    
    $timer_start = microtime(true);
    $simple_dom = new simple_html_dom();
    $simple_dom->load($html);
    foreach( $simple_dom->find("a") as $node)
    {
       $data['simple_dom'][] = $node->href;
    }
    
    foreach( $simple_dom->find("img") as $node)
    {
       $data['simple_dom'][] = $node->src;
    }
    
    foreach( $simple_dom->find("input") as $node)
    {
       $data['simple_dom'][] = $node->name;
    }
    $simple_dom_time =  microtime(true) - $timer_start;
    
    echo "simple html  DOM : $simple_dom_time\n";
    

    这篇关于HTML抓取和CSS查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆