Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因 [英] Scrapy - LinkExtractor in control flow and why it doesn't work

查看：41 发布时间：2021/7/16 22:24:50 python web-scraping scrapy web-crawler

本文介绍了Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想了解为什么我的 LinkExtractor 不起作用以及它何时在爬行循环中实际运行?

I'm trying to understand why my LinkExtractor doesn't work and when it is actually running in the crawl loop?

这是我正在抓取的页面.

This is the page I'm crawling.

每个页面上有 25 个列表，它们的链接在 parse_page
然后在parse_item

There are 25 listings on each page and their links are parsed in parse_page
Then each crawled link are parsed in parse_item

此脚本可以毫无问题地抓取第一页及其中的项目.问题是，它不遵循 https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2(sayfa 在土耳其语中的意思是页面)和其他下一页.

This script crawls the first page and the items in it without any problem. The problem is, it doesn't follow to https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2 (sayfa means page in Turkish) and the other next pages.

我认为我的 Rule 和 LinkExtractor 是正确的，因为当我尝试允许所有链接时，它也不起作用.

I think my Rule and LinkExtractor are correct because when I tried to allow all links, it didn't work either.

我的问题是；

LinkExtractors 什么时候应该在这个脚本中运行，为什么它们没有运行?
如何让蜘蛛跟随到下一页，解析页面并使用 LinkExtractors 解析其中的项目?
如何使用 LinkExtractor 实现 parse_page?

When are the LinkExtractors are supposed to run in this script and why they are not running?
How can I make the spider follow to the next pages, parse the pages and parse the items in them with LinkExtractors?
How can I implement the parse_page with the LinkExtractor?

这是我的蜘蛛的相关部分.

This is my spider's relevant parts.

class YenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)),
             callback='parse_page',
             follow=True),
    )


    def __init__(self):
        super().__init__()
        self.allowed_domains = ['yenibiris.com']

        self.start_urls = [
            'https://www.yenibiris.com/is-ilanlari?q=yazilim',
        ]


    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                method='GET',
                callback=self.parse_page
            )

    def parse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    def parse_items(self, response):

        # crawling the item without any problem here

        yield item

推荐答案

我不想回答我自己的问题，但我想我想通了.当我定义 start_requests 函数时，我可能会覆盖 rules 行为，所以它不起作用.当我删除 __init__ 和 start_requests 函数时，蜘蛛按预期工作.

I hate to answer my own question, but I think I figured it out. When I define the start_requests function, I might be overriding the rules behavior, so it didn't work. When I remove the __init__ and start_requests functions, spider works as intended.

class YenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    start_urls = [
        'https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=1',
    ]

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)), callback='parse_page', follow=True),
    )


    def parse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    def parse_items(self, response):

       # crawling the item without any problem here 

        yield item

这篇关于Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因 [英] Scrapy - LinkExtractor in control flow and why it doesn't work

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因 [英] Scrapy - LinkExtractor in control flow and why it doesn&#39;t work

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因 [英] Scrapy - LinkExtractor in control flow and why it doesn't work

登录关闭