Scrapy 递归链接爬虫 [英] Scrapy recursive link crawler

查看:49
本文介绍了Scrapy 递归链接爬虫的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

它以网络上的网址开头(例如:http://python.org),获取网络-对应于该 url 的页面,并将该页面上的所有链接解析为链接存储库.接下来,它从刚刚创建的存储库中获取任何 url 的内容,将来自这个新内容的链接解析到存储库中,并为存储库中的所有链接继续这个过程,直到停止或在获取给定数量的链接之后.

It starts with a url on the web (ex: http://python.org), fetches the web-page corresponding to that url, and parses all the links on that page into a repository of links. Next, it fetches the contents of any of the url from the repository just created, parses the links from this new content into the repository and continues this process for all links in the repository until stopped or after a given number of links are fetched.

我如何使用 python 和 scrapy 做到这一点?我能够抓取网页中的所有链接,但如何深入递归地执行它

How can i do that using python and scrapy?. I am able to scrape all links in a webpage but how to perform it recursively in depth

推荐答案

几点说明:

  • 您不需要 Scrapy 来完成如此简单的任务.Urllib(或 Requests)和 html 解析器(Beautiful soup 等)可以完成这项工作
  • 我不记得在哪里听说过,但我认为使用 BFS 算法.您可以轻松避免循环引用.
  • you don't need Scrapy for such a simple task. Urllib (or Requests) and a html parser (Beautiful soup, etc.) can do the job
  • I don't recall where I've heard it, but I think it's better to crawl using BFS algorithms. You can easily avoid circular references.

下面是一个简单的实现:它不获取内部链接(只有绝对形式的超链接),也没有任何错误处理(403,404,没有链接,...),而且速度非常慢(multiprocessing 模块在这种情况下可以提供很多帮助).

Below a simple implementation : it does not fetcch internal links (only absolute formed hyperlinks) nor does it have any Error handling (403,404,no links,...), and it is abysmally slow ( the multiprocessing module can help a lot in this case).

import BeautifulSoup
import urllib2
import itertools
import random


class Crawler(object):
    """docstring for Crawler"""

    def __init__(self):

        self.soup = None                                        # Beautiful Soup object
        self.current_page   = "http://www.python.org/"          # Current page's address
        self.links          = set()                             # Queue with every links fetched
        self.visited_links  = set()

        self.counter = 0 # Simple counter for debug purpose

    def open(self):

        # Open url
        print self.counter , ":", self.current_page
        res = urllib2.urlopen(self.current_page)
        html_code = res.read()
        self.visited_links.add(self.current_page) 

        # Fetch every links
        self.soup = BeautifulSoup.BeautifulSoup(html_code)

        page_links = []
        try :
            page_links = itertools.ifilter(  # Only deal with absolute links 
                                            lambda href: 'http://' in href,
                                                ( a.get('href') for a in self.soup.findAll('a') )  )
        except Exception: # Magnificent exception handling
            pass



        # Update links 
        self.links = self.links.union( set(page_links) ) 



        # Choose a random url from non-visited set
        self.current_page = random.sample( self.links.difference(self.visited_links),1)[0]
        self.counter+=1


    def run(self):

        # Crawl 3 webpages (or stop if all url has been fetched)
        while len(self.visited_links) < 3 or (self.visited_links == self.links):
            self.open()

        for link in self.links:
            print link



if __name__ == '__main__':

    C = Crawler()
    C.run()

输出:

In [48]: run BFScrawler.py
0 : http://www.python.org/
1 : http://twistedmatrix.com/trac/
2 : http://www.flowroute.com/
http://www.egenix.com/files/python/mxODBC.html
http://wiki.python.org/moin/PyQt
http://wiki.python.org/moin/DatabaseProgramming/
http://wiki.python.org/moin/CgiScripts
http://wiki.python.org/moin/WebProgramming
http://trac.edgewall.org/
http://www.facebook.com/flowroute
http://www.flowroute.com/
http://www.opensource.org/licenses/mit-license.php
http://roundup.sourceforge.net/
http://www.zope.org/
http://www.linkedin.com/company/flowroute
http://wiki.python.org/moin/TkInter
http://pypi.python.org/pypi
http://pycon.org/#calendar
http://dyn.com/
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.
google.com/public/basic.ics
http://www.pygame.org/news.html
http://www.turbogears.org/
http://www.openbookproject.net/pybiblio/
http://wiki.python.org/moin/IntegratedDevelopmentEnvironments
http://support.flowroute.com/forums
http://www.pentangle.net/python/handbook/
http://dreamhost.com/?q=twisted
http://www.vrplumber.com/py3d.py
http://sourceforge.net/projects/mysql-python
http://wiki.python.org/moin/GuiProgramming
http://software-carpentry.org/
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.
google.com/public/basic.ics
http://wiki.python.org/moin/WxPython
http://wiki.python.org/moin/PythonXml
http://www.pytennessee.org/
http://labs.twistedmatrix.com/
http://www.found.no/
http://www.prnewswire.com/news-releases/voip-innovator-flowroute-relocates-to-se
attle-190011751.html
http://www.timparkin.co.uk/
http://docs.python.org/howto/sockets.html
http://blog.python.org/
http://docs.python.org/devguide/
http://www.djangoproject.com/
http://buildbot.net/trac
http://docs.python.org/3/
http://www.prnewswire.com/news-releases/flowroute-joins-voxbones-inum-network-fo
r-global-voip-calling-197319371.html
http://www.psfmember.org
http://docs.python.org/2/
http://wiki.python.org/moin/Languages
http://sip-trunking.tmcnet.com/topics/enterprise-voip/articles/341902-grandstrea
m-ip-voice-solutions-receive-flowroute-certification.htm
http://www.twitter.com/flowroute
http://wiki.python.org/moin/NumericAndScientific
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.
google.com/public/basic.ics
http://freecode.com/projects/pykyra
http://www.xs4all.com/
http://blog.flowroute.com
http://wiki.python.org/moin/PyGtk
http://twistedmatrix.com/trac/
http://wiki.python.org/moin/
http://wiki.python.org/moin/Python2orPython3
http://stackoverflow.com/questions/tagged/twisted
http://www.pycon.org/

这篇关于Scrapy 递归链接爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆