如何在没有请求的情况下在 Scrapy 中屈服? [英] How to yield in Scrapy without a request?

查看:46
本文介绍了如何在没有请求的情况下在 Scrapy 中屈服?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Scrapy 2.4 抓取定义的 URL 列表,其中每个 URL 最多可以有 5 个我想要关注的分页 URL.

I am trying to crawl a defined list of URLs with Scrapy 2.4 where each of those URLs can have up to 5 paginated URLs that I want to follow.

现在系统也可以正常工作了,我确实有一个额外的请求我想摆脱:

Now also the system works, I do have one extra request I want to get rid of:

那些页面完全相同,但具有不同的 URL:

Those pages are exactly the same but have a different URL:

example.html
example.thml?pn=1

我在代码的某处做了这个额外的请求,但我不知道如何抑制它.

Somewhere in my code I do this extra request and I can not figure out how to surpress it.

这是工作代码:

定义一堆要抓取的 URL:

Define a bunch of URLs to scrape:

start_urls = [
    'https://example...',
    'https://example2...',
]

开始请求所有起始网址;

Start requesting all start urls;

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(
            url = url,
            callback=self.parse,
        )

解析起始网址:

def parse(self, response):
    url = response.url  + '&pn='+str(1)
    yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(pn=1, base_url=response.url))

从起始 URL 获取所有分页 URL;

Go get all paginated URLs from the start URLs;

def parse_item(self, response, pn, base_url):
    self.logger.info('Parsing %s', response.url)        
    if pn < 6: # maximum level 5
        url = base_url + '&pn='+str(pn+1)
        yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,pn=pn+1))

推荐答案

如果我理解你的问题是正确的,你只需要改变从 ?pn=1 开始并忽略没有 pn=null 的那个,这里有一个选项如何我会这样做,它也只需要一种解析方法.

If I understand you're question correct you just need to change to start at ?pn=1 and ignore the one without pn=null, here's an option how i would do it, which also only requires one parse method.

start_urls = [
    'https://example...',
    'https://example2...',
]

def start_requests(self):
    for url in self.start_urls:
        #how many pages to crawl
        for i in range(1,6):
            yield scrapy.Request(
                url=url + f'&pn={str(i)}'
            )

def parse(self, response):
    self.logger.info('Parsing %s', response.url) 

这篇关于如何在没有请求的情况下在 Scrapy 中屈服?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆