如何使用scrapy抓取网站中的所有项目 [英] how to use scrapy to crawl all items in a website

查看:68
本文介绍了如何使用scrapy抓取网站中的所有项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用递归来抓取网站中的所有链接.并解析所有链接页面,提取链接页面中的所有详细链接.如果页面链接符合规则,则页面链接是我想要解析详细信息的项目.我使用下面的代码:

i want to use recursion to crawl all the links in a website. and parse all the link pages, to extract all the detail links in the link pages. if the page link confroms to a rule, the page link is a item i want to parse detail. i use the code below:

class DmovieSpider(BaseSpider):
    name = "dmovie"
    allowed_domains = ["movie.douban.com"]
    start_urls = ['http://movie.douban.com/']
    def parse(self, response):
        item = DmovieItem()
        hxl = HtmlXPathSelector(response)
        urls = hxl.select("//a/@href").extract()
        all_this_urls = []
        for url in urls:
            if re.search("movie.douban.com/subject/\d+/$",url):
                yield Request(url=url, cookies = cookies ,callback=self.parse_detail) 
            elif ("movie.douban.com" in url) and ("movie.douban.com/people" not in url) and ("movie.douban.com/celebrity" not in url) and ("comment" not in url):
                if ("update" not in url) and ("add" not in url) and ("trailer" not in url) and ("cinema" not in url) and (not redis_conn.sismember("crawledurls", url)):
                    all_this_urls.append(Request(url=url, cookies = cookies , callback=self.parse))
        redis_conn.sadd("crawledurls",response.url)
        for i in all_this_urls:
            yield i


    def parse_detail(self, response):
        hxl = HtmlXPathSelector(response)
        title = hxl.select("//span[@property='v:itemreviewed']/text()").extract()
        title = select_first(title)
        img = hxl.select("//div[@class='grid-16-8 clearfix']//a[@class='nbgnbg']/img/@src").extract()
        img = select_first(img)
        info = hxl.select("//div[@class='grid-16-8 clearfix']//div[@id='info']")
        director = info.select("//a[@rel='v:directedBy']/text()").extract()
        director = select_first(director)
        actors = info.select("//a[@rel='v:starring']/text()").extract()
        m_type = info.select("//span[@property='v:genre']/text()").extract()
        release_date = info.select("//span[@property='v:initialReleaseDate']/text()").extract()
        release_date = select_first(release_date)

        d_rate = info.select("//strong[@class='ll rating_num']/text()").extract()
        d_rate = select_first(d_rate)

        info = select_first(info)
        post = hxl.select("//div[@class='grid-16-8 clearfix']//div[@class='related-info']/div[@id='link-report']").extract()
        post = select_first(post)
        movie_db = Movie()
        movie_db.name = title.encode("utf-8")
        movie_db.dis_time = release_date.encode("utf-8")
        movie_db.description = post.encode("utf-8")
        movie_db.actors = "::".join(actors).encode("utf-8")
        movie_db.director = director.encode("utf-8")
        movie_db.mtype = "::".join(m_type).encode("utf-8")
        movie_db.origin = "movie.douban.com"
        movie_db.d_rate = d_rate.encode("utf-8")
        exist_item =  Movie.where(origin_url=response.url).select().fetchone()
        if not exist_item:
            movie_db.origin_url = response.url
            movie_db.save()
            print "successed!!!!!!!!!!!!!!!!!!!!!!!!!!!"

urls 是页面中的所有链接.如果其中一个 url 是我要解析的详细信息页面,则生成一个回调方法为 parse_detail 的请求.否则产生回调方法解析的请求.

urls is all the links in the page.if one of the urls is the detail page i want to parse, yield a Request which callback method is parse_detail. else yield a request that callback method is parse.

通过这种方式,我抓取了一些页面,但似乎页面未满,在我的结果中,似乎有些页面没有被访问.你能告诉我怎么做吗?有什么方法可以正确抓取所有页面吗?

in this way , i crawled some pages, but it seems that the pages is not full, at my result, it seems that some pages is not visited. could you tell me how ?is there some way to crawl all the pages correctly?

推荐答案

尝试 爬行蜘蛛.

使用爬行规则过滤网址.(演示)

然后设置 DEPTH_LIMIT = 0settings.py 以确保蜘蛛抓取网站中的所有页面.

And then set DEPTH_LIMIT = 0 in the settings.py to make sure the spider crawls all pages in the website.

这篇关于如何使用scrapy抓取网站中的所有项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆