Scrapy:如何在 302 的情况下停止请求? [英] Scrapy: How to stop requesting in case of 302?

查看：66 发布时间：2021/7/16 22:10:21 python scrapy

本文介绍了Scrapy:如何在 302 的情况下停止请求?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Scrapy 2.4 从 start_urls 列表中抓取特定页面.每个 URL 大概有 6 个结果页面，所以我请求它们全部.

I am using Scrapy 2.4 to crawl specific pages from a start_urls list. Each of those URLs has persumably 6 result pages, so I request them all.

但在某些情况下，只有 1 个结果页面，所有其他分页页面都会返回 302 到 pn=1.在这种情况下，我不想跟随那个 302，也不想继续查找第 3、4、5、6 页，而是继续查找列表中的下一个 URL.

In some cases however there is only 1 result page and all other paginated pages return a 302 to pn=1. In this case I do not want to follow that 302 nor do I want to continue looking for page 3,4,5,6 but rather continue to the next URL in the list.

如何在 302/301 的情况下退出(继续)这个 for 循环以及如何不遵循 302?

How to exit (continue) this for loop in case of a 302/301 and how to not follow that 302?

def start_requests(self):
    for url in self.start_urls:
        for i in range(1,7): # 6 pages
            yield scrapy.Request(
                url=url + f'&pn={str(i)}'
            )

def parse(self, request):

    # parse page
    ...

    # recognize no pagination and somehow exit the for loop
    if not response.xpath('//regex'): 
        # ... continue somehow instead of going to page 2

推荐答案

你的方法的主要问题是，从 start_requests 开始，我们无法预先知道有多少有效页面存在.

The main problem of your approach is that from start_requests we can't know for ahead how many valid pages exists.

这种情况的常用方法
以这种方式一个一个地调度请求，而不是循环:

Common approach for this type of cases
is to schedule requests one by one in this way istead of loop:

class somespider(scrapy.Spider):
...
    def start_requests(self):
        ...
        for u in self.start_urls:
            # schedule only first page of each "query"
            yield scrapy.Request(url=u+'&pn=1', callback=self.parse)

    def parse(self, response):
        r_url, page_number = response.url.split("&pn=")
        page_number = int(page_number)
        ....
        if next_page_exists:
            yield scrapy.Request(
            url = f'{r_url}&pn={str(page_number+1)}',
            callback = self.parse)
       else:
           # something else
           ...

这篇关于Scrapy:如何在 302 的情况下停止请求?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy:如何在 302 的情况下停止请求? [英] Scrapy: How to stop requesting in case of 302?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy:如何在 302 的情况下停止请求? [英] Scrapy: How to stop requesting in case of 302?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭