网页爬虫 - 项目中关于scrapy疑问求解~

查看:101
本文介绍了网页爬虫 - 项目中关于scrapy疑问求解~的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问 题

Spider

class JobboleSpider(scrapy.Spider):
    name = "jobbole"
    allowed_domains = ["blog.jobbole.com"]
    start_urls = ['http://blog.jobbole.com/all-posts/',]

    def parse(self, response):
        try:
            # 获取最大页
            max_page = int(response.css("a.page-numbers::text").extract()[2]) 
        except IndexError:
            print("max_page_error")
            
       

@@@@@@@问题一: start_urls[0] = 下面构筑的gape_1 然后报重定向,我该怎么优化

        for pg in range(1,max_page+1):
            #构筑每页链接 
            page_url = "http://blog.jobbole.com/all-posts/page/" + str(pg) + "/"
            yield Request(url=page_url,callback=self.href_parsen)

    def href_parse(self, response):
        #获取详情页url
        sub_node = response.css("div.post.floated-thumb > div.post-thumb")
        for sub in sub_node:
            image_url = sub.css("img::attr(src)").extract_first()
            post_href = sub.css("a::attr(href)").extract_first()
            return Request(url=post_href,
                           meta={"image_url":parse.urljoin(response.url,image_url)},
                           callback=self.detail_parse)

    def detail_parse(self, response):
        #解析详情页
        item = JobBboleArticleItem()
        item["images_url"] = [response.meta.get("image_url","")]
        
        
        

Pipelines

class JobboleJsonPipeline(object):
    #自定义json
    def __init__(self):
        self.file = codecs.open("jobbole1.json", "w", encoding="utf-8")

    def process_item(self, item, spider):
        self.row = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(self.row)
        return item

    def spider_closed(self, spdier):
        self.file.close()

class JobboleJsonExporterPinpeline(object):
    #scrapy.jsonpipeline
    def __init__(self):
        self.file = open("jobbole2.json", "wb")
        self.expoter = JsonItemExporter(self.file, encoding="utf-8", ensure_ascii=False)
        self.expoter.start_exporting()  # JsonItemExporter 导出文件

    def process_item(self, item, spider):
        self.expoter.export_item(item)
        return item

    def close_spider(self, spider):
        self.expoter.finish_exporting()
        self.file.close()
        

@@@@@@@@@问题2:pycharm中删除刚才导出的JSON,文件会弹出窗口,如下

望求解!

解决方案

class JobboleSpider(scrapy.Spider):
    name = "jobbole"
    allowed_domains = ["blog.jobbole.com"]
    start_urls = ['http://blog.jobbole.com/all-posts/',]

    def parse(self, response):
        url_1st_pg = response.url
        yield Request(url= url_1st_pg, callback=self.href_parse)

这篇关于网页爬虫 - 项目中关于scrapy疑问求解~的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆