Scrapy,未调用客户方法 [英] Scrapy, custome method not called

查看:64
本文介绍了Scrapy,未调用客户方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

scrapy解析网页时遇到问题,我的custome方法没有被scrapy调用.网址是:http://www.duilian360.com/chunjie/117.html,代码是:

I met a problem when I parse a web page by scrapy, my custome method was not called by scrapy. the url is: http://www.duilian360.com/chunjie/117.html, and the code is:

import scrapy
from shufa.items import DuilianItem

class DuilianSpiderSpider(scrapy.Spider):
    name = 'duilian_spider'
    start_urls = [
        {"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
    ]
    base_url = 'http://www.duilian360.com'

    def start_requests(self):
        for topic in self.start_urls:
            url = topic['url']
            yield scrapy.Request(url=url)

    def parse(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        self.parse_paragraph(div_list)

    def parse_paragraph(self, div_list):
        for div in div_list:
            duilian_text_list = div.xpath('./text()').extract()
            for duilian_text in duilian_text_list:
                duilian_item = DuilianItem()
                duilian_item['category_id'] = 1
                duilian = duilian_text
                duilian_item['name'] = duilian
                duilian_item['desc'] = ''
                print('I reach here...')
                yield duilian_item

在上面的代码中,没有调用方法 parse_paragraph,因为 print 子句没有输出,即使我设置了断点,我也无法进入这个方法打印线.

On above code, method parse_paragraph was not called, since the print clause has no output, I can't step into this method even when i set a breakpoint on the print line.

但是如果我将方法 parse_paragraph 中的所有代码移动到调用方法 parse_page 中,如下所示,那么一切正常,为什么?

But if I move all code in method parse_paragraph into the calling method parse_page as below, then everything works well, why?

# -*- coding: utf-8 -*-
import scrapy
from shufa.items import DuilianItem

class DuilianSpiderSpider(scrapy.Spider):
    name = 'duilian_spider'
    start_urls = [
        {"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
    ]
    base_url = 'http://www.duilian360.com'

    def start_requests(self):
        for topic in self.start_urls:
            url = topic['url']
            yield scrapy.Request(url=url)

    def parse(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        for div in div_list:
            duilian_text_list = div.xpath('./text()').extract()
            for duilian_text in duilian_text_list:
                duilian_item = DuilianItem()
                duilian_item['category_id'] = 1
                duilian = duilian_text
                duilian_item['name'] = duilian
                duilian_item['desc'] = ''
                print('I reach here...')
                yield duilian_item

    # def parse_paragraph(self, div_list):
    #     for div in div_list:
    #         duilian_text_list = div.xpath('./text()').extract()
    #         for duilian_text in duilian_text_list:
    #             duilian_item = DuilianItem()
    #             duilian_item['category_id'] = 1
    #             duilian = duilian_text
    #             duilian_item['name'] = duilian
    #             duilian_item['desc'] = ''
    #             print('I reach here...')
    #             yield duilian_item

我的代码有很多自定义方法,我不希望将其中的所有代码移动到调用方法中.这不是一个好习惯.

My code has lots of custome method, and I don't want the move all code in them to the calling method. this is not a good practice.

推荐答案

我会使用 yield from 而不是直接调用 parse_paragraph 因为它返回一个 generator 而不是从另一个解析器产生项目/请求.

I would use yield from instead direct calling parse_paragraph since that returns a generator rather than yielding items/requests from another parser.

    def parse(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        yield from self.parse_paragraph(div_list)

这篇关于Scrapy,未调用客户方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆