Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因 [英] Scrapy - LinkExtractor in control flow and why it doesn't work

查看:41
本文介绍了Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解为什么我的 LinkExtractor 不起作用以及它何时在爬行循环中实际运行?

I'm trying to understand why my LinkExtractor doesn't work and when it is actually running in the crawl loop?

这是我正在抓取的页面.

This is the page I'm crawling.

  • 每个页面上有 25 个列表,它们的链接在 parse_page
  • 中解析
  • 然后在parse_item
  • 中解析每个抓取到的链接
  • There are 25 listings on each page and their links are parsed in parse_page
  • Then each crawled link are parsed in parse_item

此脚本可以毫无问题地抓取第一页及其中的项目.问题是,它不遵循 https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2(sayfa 在土耳其语中的意思是页面)和其他下一页.

This script crawls the first page and the items in it without any problem. The problem is, it doesn't follow to https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2 (sayfa means page in Turkish) and the other next pages.

我认为我的 RuleLinkExtractor 是正确的,因为当我尝试允许所有链接时,它也不起作用.

I think my Rule and LinkExtractor are correct because when I tried to allow all links, it didn't work either.

我的问题是;

  • LinkExtractors 什么时候应该在这个脚本中运行,为什么它们没有运行?
  • 如何让蜘蛛跟随到下一页,解析页面并使用 LinkExtractors 解析其中的项目?
  • 如何使用 LinkExtractor 实现 parse_page?
  • When are the LinkExtractors are supposed to run in this script and why they are not running?
  • How can I make the spider follow to the next pages, parse the pages and parse the items in them with LinkExtractors?
  • How can I implement the parse_page with the LinkExtractor?

这是我的蜘蛛的相关部分.

This is my spider's relevant parts.

class YenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)),
             callback='parse_page',
             follow=True),
    )


    def __init__(self):
        super().__init__()
        self.allowed_domains = ['yenibiris.com']

        self.start_urls = [
            'https://www.yenibiris.com/is-ilanlari?q=yazilim',
        ]


    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                method='GET',
                callback=self.parse_page
            )

    def parse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    def parse_items(self, response):

        # crawling the item without any problem here

        yield item

推荐答案

我不想回答我自己的问题,但我想我想通了.当我定义 start_requests 函数时,我可能会覆盖 rules 行为,所以它不起作用.当我删除 __init__start_requests 函数时,蜘蛛按预期工作.

I hate to answer my own question, but I think I figured it out. When I define the start_requests function, I might be overriding the rules behavior, so it didn't work. When I remove the __init__ and start_requests functions, spider works as intended.

class YenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    start_urls = [
        'https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=1',
    ]

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)), callback='parse_page', follow=True),
    )


    def parse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    def parse_items(self, response):

       # crawling the item without any problem here 

        yield item

这篇关于Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆