如何使用scrapy抓取网站中的所有项目 [英] how to use scrapy to crawl all items in a website

查看：68 发布时间：2021/7/16 22:26:42 scrapy web-crawler

本文介绍了如何使用scrapy抓取网站中的所有项目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用递归来抓取网站中的所有链接.并解析所有链接页面，提取链接页面中的所有详细链接.如果页面链接符合规则，则页面链接是我想要解析详细信息的项目.我使用下面的代码:

i want to use recursion to crawl all the links in a website. and parse all the link pages, to extract all the detail links in the link pages. if the page link confroms to a rule, the page link is a item i want to parse detail. i use the code below:

class DmovieSpider(BaseSpider):
    name = "dmovie"
    allowed_domains = ["movie.douban.com"]
    start_urls = ['http://movie.douban.com/']
    def parse(self, response):
        item = DmovieItem()
        hxl = HtmlXPathSelector(response)
        urls = hxl.select("//a/@href").extract()
        all_this_urls = []
        for url in urls:
            if re.search("movie.douban.com/subject/\d+/$",url):
                yield Request(url=url, cookies = cookies ,callback=self.parse_detail) 
            elif ("movie.douban.com" in url) and ("movie.douban.com/people" not in url) and ("movie.douban.com/celebrity" not in url) and ("comment" not in url):
                if ("update" not in url) and ("add" not in url) and ("trailer" not in url) and ("cinema" not in url) and (not redis_conn.sismember("crawledurls", url)):
                    all_this_urls.append(Request(url=url, cookies = cookies , callback=self.parse))
        redis_conn.sadd("crawledurls",response.url)
        for i in all_this_urls:
            yield i


    def parse_detail(self, response):
        hxl = HtmlXPathSelector(response)
        title = hxl.select("//span[@property='v:itemreviewed']/text()").extract()
        title = select_first(title)
        img = hxl.select("//div[@class='grid-16-8 clearfix']//a[@class='nbgnbg']/img/@src").extract()
        img = select_first(img)
        info = hxl.select("//div[@class='grid-16-8 clearfix']//div[@id='info']")
        director = info.select("//a[@rel='v:directedBy']/text()").extract()
        director = select_first(director)
        actors = info.select("//a[@rel='v:starring']/text()").extract()
        m_type = info.select("//span[@property='v:genre']/text()").extract()
        release_date = info.select("//span[@property='v:initialReleaseDate']/text()").extract()
        release_date = select_first(release_date)

        d_rate = info.select("//strong[@class='ll rating_num']/text()").extract()
        d_rate = select_first(d_rate)

        info = select_first(info)
        post = hxl.select("//div[@class='grid-16-8 clearfix']//div[@class='related-info']/div[@id='link-report']").extract()
        post = select_first(post)
        movie_db = Movie()
        movie_db.name = title.encode("utf-8")
        movie_db.dis_time = release_date.encode("utf-8")
        movie_db.description = post.encode("utf-8")
        movie_db.actors = "::".join(actors).encode("utf-8")
        movie_db.director = director.encode("utf-8")
        movie_db.mtype = "::".join(m_type).encode("utf-8")
        movie_db.origin = "movie.douban.com"
        movie_db.d_rate = d_rate.encode("utf-8")
        exist_item =  Movie.where(origin_url=response.url).select().fetchone()
        if not exist_item:
            movie_db.origin_url = response.url
            movie_db.save()
            print "successed!!!!!!!!!!!!!!!!!!!!!!!!!!!"

urls 是页面中的所有链接.如果其中一个 url 是我要解析的详细信息页面，则生成一个回调方法为 parse_detail 的请求.否则产生回调方法解析的请求.

urls is all the links in the page.if one of the urls is the detail page i want to parse, yield a Request which callback method is parse_detail. else yield a request that callback method is parse.

通过这种方式，我抓取了一些页面，但似乎页面未满，在我的结果中，似乎有些页面没有被访问.你能告诉我怎么做吗?有什么方法可以正确抓取所有页面吗?

in this way , i crawled some pages, but it seems that the pages is not full, at my result, it seems that some pages is not visited. could you tell me how ?is there some way to crawl all the pages correctly?

如何使用scrapy抓取网站中的所有项目 [英] how to use scrapy to crawl all items in a website

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用scrapy抓取网站中的所有项目 [英] how to use scrapy to crawl all items in a website

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭